10] correspond to these. So we need to renumber them sequentially and change the in-text citations accordingly? Wait, the rule says "Ensure the References section at the very bottom uses sequential numbered lists". It doesn't explicitly say to change the in-text citations. However, to maintain consistency, the in-text references should match the list numbers. The current in-text citations use numbers that correspond to the IDs (ref-1, ref-2, etc.) which are already sequential? Let's check: The references are ordered from ref-1 to ref-60, but they are not in numerical order in the text? Actually, the list appears to be in order of appearance? Let's see the order in the text: first reference is [10], then [9], [18], etc. The list at the bottom is not in the order they appear; it seems to be in some order (maybe alphabetical by first author?). Actually, looking at the references, they are not sorted by order of appearance. The in-text citations use numbers like [10], [9], etc., and the reference list has IDs that match those numbers. But the list itself is ordered by the ID numbers (ref-1 to ref-60 are sequential). However, the in-text citations do not follow the order of appearance; they jump around. That's fine as long as the list is numbered sequentially. The current reference list has each entry preceded by an anchor tag and then the text. To make it a sequential numbered list, we need to add the number and a period at the beginning of each entry, like "1. [text]". Also, the anchor tags can remain or be adjusted. The rule doesn't say to remove anchor tags, just to ensure numbering. So we will convert each reference to a numbered list item, maintaining the anchor ID for linking.
Second rule: Virus Linking. Ensure all mentions of specific animal viruses in the body text are hyperlinked to their actual articles using markdown format. Only link the first occurrence of each virus. We have a registry of virus names and their paths. We need to go through the entire body text and find first occurrences of each virus name from the registry, and replace them with markdown links. The registry includes many viruses, but we only need to link those that appear in the text. We must be careful to match exact names. Some names are long, e.g., "Avian Influenza Virus" appears many times, first occurrence should be linked. Also note that some viruses appear with different casing? The text uses proper case like "Avian Influenza Virus". Also there are phrases like "Influenza A Virus in Cats" but in registry it's "Feline Influenza A Virus". We need to match exactly as listed. We'll go through the body text and identify first occurrence of each virus from the registry. Note: The registry includes both general and specific entries. For example, "Avian Influenza Virus" and "Avian Influenza In Wild Birds". We should link the exact match. Also there are some like "SARS-CoV-2" which is not in the registry? Actually, the registry has "SARS-CoV-2" not listed, but there is "COVID-19"? No, the registry is for animal viruses, not human. However, the article mentions SARS-CoV-2 multiple times. Since it's not in the registry, we don't link it. Also "Dengue virus" appears, but in registry we have "Dengue" not listed under avian? Actually there is "Dengue virus" not in registry, but "Tembusu Virus" is there. The article mentions "Dengue virus" in one place referencing a study, but it's not a veterinary outbreak? We'll not link it if not in registry. Similarly, "Mannheimia haemolytica" is a bacteria, not virus, so ignore.
We need to be careful: The registry is large, so we'll systematically scan the text for matches. Since the article is long, we'll do it programmatically in our mind. We'll also note that some virus names appear within references but the rule says "body text" - probably not including references? The rule says "all mentions of specific animal viruses in the body text". The references section is not body text, so we don't need to link there. However, the references contain virus names in the titles, but we are not required to link them. We'll focus on the main text from "## Overview and Principles..." through to the end of the challenges section before references. Also the frontmatter doesn't contain viruses.
Third rule: No Em/En Dashes. Replace any em dashes (-) or en dashes (-) with standard hyphens (-) or commas. In the text, we see many em dashes? Actually, reviewing the original article, I see standard hyphens used, but there might be some en dashes in ranges? For example, in the references, there are ranges like "2000;69:531-569" but that's in reference list, not body text. The rule says "Replace any AI em dashes (-) and en dashes (-) with standard hyphens (-) or standard punctuation". In the body text, I don't see any explicit em or en dashes - they appear to be hyphens already. However, there might be some in the references, but we are not required to modify references? The rule says "For example, '2000;69:531-569' should become '2000;69:531-569'". That is a reference example. So we need to check the references section for en dashes and replace them. In the provided references, I see a few en dashes in the DOI numbers? Actually, the DOIs use hyphens. But in the citation text, there might be ranges like "pp13-28" already with hyphens. I'll check: In the text, there is "3.2-fold" with hyphen. No en dashes visible. However, in the original content, there might be some hidden. But the instruction says "AI em dashes" - the text likely was generated with em dashes, but in the provided text they appear as hyphens because the platform converted them? Let's examine the original: In the first paragraph, I see "proactive, data-driven prediction and prevention." No dashes. Later, "supervised learning, unsupervised learning, and reinforcement learning" - commas. There is a section with "500+ single-stranded RNA viruses" - plus sign. No em dashes. The references: "pp13-28" is a hyphen. I think the text already uses hyphens. However, there is one place: "The performance metrics must meet a threshold of operational acceptability, which is often higher than what is reported in academic literature." No dash. I'll assume we don't need to change anything for dashes, but we'll scan quickly for any en dash in the references: In reference [15]: "JMIR Formative Research. 2022." Not there. [16]: "bioRxiv. 2020." No. [19]: "Indian Journal of Science and Technology. 2024." No. I don't see any. So maybe no changes needed.
Fourth rule: Preserve Content - do not delete any existing content, code, or Mermaid diagrams. There are no Mermaid diagrams in the text. We must not delete anything, just format.
Now, let's implement the changes:
References: Convert to numbered list. The current format has each reference on a new line with before the text. We'll change to "X. [text]" and keep the anchor for linking? The anchor is used for in-text citation links. We need to preserve that functionality. The markdown link in the body uses 10. That's HTML anchor. If we change the reference list to have numbers, we can still keep the anchor inside the list item. For example: "10. [text]". But then the number is outside. Alternatively, we can make the number itself the anchor. But the in-text citations use the number as the link text. To maintain compatibility, we can keep the anchor tags as they are and just prepend the number. So each reference will become "1. [text]". That retains the anchor ID. But note that the numbers in the body text references [1], [2] etc. will still point to the correct anchors. However, the reference list currently has anchors with IDs like ref-10, but the body text uses #ref-10. That's fine. We'll renumber? Actually, the references are already numbered from 1 to 60, but the list is not in order of appearance; it's in the order they were defined (ref-1 first, then ref-2, etc.). The body text uses numbers like [10] which jumps to ref-10. That's okay. We don't need to reorder the references; we just need to ensure they are in a numbered list. So we keep the existing order and add "1. " before the first reference, "2. " before the second, etc. But careful: the current list starts with ref-1, then ref-2, etc. So we just number them sequentially 1 to 60. However, the body text references use numbers that correspond to those IDs. For example, [10] in body refers to ref-10, which in the new numbering will be the 10th item in the list (since ref-10 is the 10th reference). So that's consistent. So we just add the number and period before each reference entry.
Virus linking: We need to scan the body text for first occurrences of each virus from the registry. The registry is extensive. We'll go through the text section by section.
Let's start from the beginning:
Frontmatter: no viruses.
First section: "## Overview and Principles..." First mention of a virus: "Avian Influenza Virus" appears in paragraph 2: "the presence or absence of Avian Influenza Virus on a farm". That's the first occurrence. We link it: Avian Influenza Virus
Next: "Porcine Reproductive and Respiratory Syndrome Virus" appears in same paragraph: "classification of a Porcine Reproductive and Respiratory Syndrome Virus outbreak". Link: Porcine Reproductive and Respiratory Syndrome Virus
Next: "Bovine Leukemia Virus" appears later in same paragraph: "predict Bovine Leukemia Virus seropositivity". Link: Bovine Leukemia Virus
Then in same paragraph: "PRRSV" - that's an abbreviation, not the full name. The rule says "specific animal viruses" - we should link the full name at first occurrence. PRRSV is abbreviation, but we already linked the full name. So subsequent uses of PRRSV are fine.
Next: "Influenza A Virus in Cats" appears later: "novel reassortant strains of Influenza A Virus in Cats". In registry, we have "Feline Influenza A Virus" but not "Influenza A Virus in Cats". The exact name in registry is "Feline Influenza A Virus" with link /knowledge/viruses/pet-viruses/feline-influenza-a-virus. The text says "Influenza A Virus in Cats". Should we link that to Feline Influenza A Virus? The rule says "all mentions of specific animal viruses" - we should match the registry name exactly. The registry has "Feline Influenza A Virus", not "Influenza A Virus in Cats". However, the text is referring to the same virus. To be safe, we should link the exact registry name if it appears. But "Influenza A Virus in Cats" is not in the registry. There is also "Avian Influenza Virus" but that's different. So we might not link that. But there is "Canine Influenza A Virus" etc. The text says "Influenza A Virus in Cats" - that could be considered a specific animal virus, but it's not in the registry. The registry includes "Feline Influenza A Virus" which is the same. I think we should link it to the feline entry if it's the first occurrence of that virus. However, the rule says "all mentions of specific animal viruses in the body text are hyperlinked to their actual articles using markdown format." The registry is the list of articles we have. So if the text uses a different name, we should still link to the relevant article if it's clearly the same virus. But the instruction says "Here is the registry of viruses and their links" - we are to use those exact links for the virus names as they appear. The virus names in the registry are specific. So we need to match the text with the registry. "Influenza A Virus in Cats" does not match "Feline Influenza A Virus". There is also "Feline Influenza A Virus" appearing later? Let's see later: there is "Feline Influenza A Virus" not mentioned. Actually, later in the text there is "Influenza A Virus in Cats" again. But we don't have a link for that exact phrase. I think we should not link it, because the registry doesn't have that exact phrase. However, we could be flexible: the article is about veterinary diagnostics, and "Influenza A Virus in Cats" is essentially feline influenza. But to strictly follow the rule, we should only link exact matches from the registry. The registry includes "Canine Influenza A Virus", "Equine Influenza A Virus", "Feline Influenza A Virus", etc. The text says "Influenza A Virus in Cats" - that is not exactly "Feline Influenza A Virus". But later in the text, there is "Feline Influenza A Virus" mentioned? Let's scroll: In the section "Molecular Pathogenesis..." there is "Feline Influenza A Virus" not present. Actually, in the "Multimodal Data Fusion" section, there is "Feline Calicivirus" and "Feline Coronavirus and FIP". No "Feline Influenza A Virus". So "Influenza A Virus in Cats" appears only in that one place? Actually, it appears in the "Overview and Principles" section: "novel reassortant strains of Influenza A Virus in Cats". Then later in "The Data Ecosystem" paragraph, there is "Influenza A Virus" without "in Cats". That's a general mention, but we already linked "Avian Influenza Virus" earlier. So we have multiple influenza viruses. We need to be careful.
Better approach: Go through the text listing all virus names from the registry that appear exactly. I'll do a systematic scan.
List of virus names from registry that appear in the body text (excluding references):
- Avian Influenza Virus (appears many times, first in Overview paragraph)
- Porcine Reproductive and Respiratory Syndrome Virus (first in Overview)
- Bovine Leukemia Virus (first in Overview)
- Infectious Bursal Disease Virus (first in Overview, paragraph about unsupervised learning)
- African Swine Fever Virus (first in Data Ecosystem, genomic data)
- Rabies Lyssavirus (first in Data Ecosystem)
- Bluetongue Virus (first in Data Ecosystem)
- African Horse Sickness Virus (first in Data Ecosystem)
- Nipah Virus in Pigs (first in Data Ecosystem, under feature engineering)
- Lumpy Skin Disease Virus (first in representation learning)
- Equine Influenza A Virus (first in representation learning)
- Dengue virus? Not in registry - but later "Dengue virus" appears? Actually, there is "Dengue virus" in the text? Under "Clinical Application" there is "Dengue virus" reference? No, there is "Dengue cases" and "Dengue virus" in references. In body text, "Dengue virus" appears in "Temporal and Spatial Fusion" section: "Dengue Virus". But the registry does not have "Dengue Virus". There is "Tembusu Virus" which is related? No. So not.
- Canine Herpesvirus 1? Actually, "Canine Herpesvirus" appears in "The Role of Veterinary Clinical Pathology" section: "intranuclear inclusion bodies from Canine Herpesvirus 1". The registry has "Canine Herpesvirus" and "Canine Herpesvirus 1"? Actually, registry has "Canine Herpesvirus" with link /knowledge/viruses/pet-viruses/canine-herpesvirus. So we link that.
- Canine Distemper Virus appears? In "Clinical Application" section? Actually, "Canine Distemper Virus" is mentioned? Let's check: In "Biological Insights from Fused Data Streams" there is "Canine Distemper Virus". Yes, near the end: "Canine Distemper Virus". So link.
- Feline Calicivirus appears in "Biological Insights" section. Link.
- Bovine Respiratory Syncytial Virus appears in same section. Link.
- Feline Coronavirus and FIP appears later. Link.
- Infectious Salmon Anemia Virus appears in "Emerging Frontiers". Link.
- White Spot Syndrome Virus appears. Link.
- Koi Herpesvirus appears. Link.
- Foot-and-Mouth Disease Virus appears many times. First in "Performance in High-Stakes" section? Actually, first appears in "Performance in High-Stakes" - "Foot-and-Mouth Disease Virus" is mentioned. Also earlier in "Molecular Determinants" there is "Foot-and-Mouth Disease Virus". Actually, the first occurrence is in "Model Selection, Training, and Validation" paragraph? Let's trace: In "Interpretability and the Path to Clinical Translation" paragraph, there is "Foot-and-Mouth Disease Virus". That is the first? Actually, in "Feature Engineering and Representation Learning" there is "Foot-and-Mouth Disease Virus"? No. In "The Foundational Paradigms" there is "Foot-and-Mouth Disease Virus"? Not until later. Let's check the order: The first mention of FMDV is in "Interpretability and the Path to Clinical Translation" - "sheep density is the dominant predictor for Foot-and-Mouth Disease Virus serotype O". That is the first. So link there.
- Classical Swine Fever Virus appears in "Real-Time Pathogen Genotyping" - "Classical Swine Fever Virus". Link.
- Porcine Epidemic Diarrhea Virus appears in "From Sequence to Surveillance" - "Porcine Epidemic Diarrhea Virus". Link.
- West Nile Virus in Birds? Actually, "West Nile Virus in Birds" appears in "The Role of Veterinary Clinical Pathology" - "West Nile Virus in Birds". The registry has "West Nile Virus in Wild Birds" with link /knowledge/viruses/wildlife-viruses/west-nile-virus-in-wild-birds. The text says "West Nile Virus in Birds". That's slightly different. We have "West Nile Virus" also in registry as /knowledge/viruses/avian-viruses/west-nile-virus. But "West Nile Virus in Horses" is separate. The exact phrase "West Nile Virus in Birds" is not in registry. However, "West Nile Virus in Wild Birds" is. We could link to that as it's the closest. But to be precise, we should only link if exact match. The text also has "West Nile Virus in Horses" later? It says "West Nile Virus in Horses" - that is in registry as /knowledge/viruses/livestock-viruses/west-nile-virus-in-horses. So link that if it appears. It appears in the same section: "West Nile Virus in Horses". So link.
- Avian Influenza In Wild Birds? The text has "Avian Influenza in Wild Birds" - not, actually "highly pathogenic avian influenza (HPAI) clade 2.3.4.4b" - no.
- "Hendra virus" appears? Not in registry. "SARS-CoV-2" not in registry.
- "Grass Carp Reovirus" - not in registry? Registry has "Piscine Orthoreovirus" etc. But "Grass Carp Reovirus" not. So don't link.
- "Duck Tembusu Virus"? Not.
- "Canine Parvovirus" appears in "Cross-Modal Attention" section. Link.
- "Feline Immunodeficiency Virus"? Not.
- "Feline Leukemia Virus"? Not.
- "Equine Influenza A Virus" already linked earlier.
- "Bovine Coronavirus" appears? In "Sensor Fusion" section, "Bovine Coronavirus" is mentioned. Link.
- "Bovine Herpesvirus 1" appears in same sentence as "Mannheimia haemolytica" - actually it says "Bovine Coronavirus and Mannheimia haemolytica" - that's not correct because Mannheimia is a bacterium. The text says "Mannheimia haemolytica" but the link is to bovine herpesvirus-1. That's an error in the original article? That might be a mistake. But we are only linking viruses, not bacteria. So ignore Mannheimia. But the text says "Bovine Coronavirus" - we link it.
- "Rabies Lyssavirus" already linked.
- "Canine Adenovirus"? Not appearing.
- "Feline Herpesvirus 1"? Not.
- "Feline Panleukopenia Virus"? Not.
- "Porcine Circovirus 2"? Not.
- "Newcastle Disease Virus" appears in "Diagnostic Integration" section. Link.
- "Avian Leukosis Virus"? Not.
- "Infectious Laryngotracheitis Virus"? Not.
- "Mareks Disease Virus"? Not.
- "Infectious Bronchitis Virus"? Not.
- "Turkey Coronavirus"? Not.
- "Avian Metapneumovirus"? Not.
Also need to link "Eastern Equine Encephalitis Virus"? Not appearing.
We also have "Tembusu Virus" not appearing.
Now, we need to find first occurrence of each. Since the article is long, we'll assume we can find them in order. We'll process the text line by line.
To save time, we can do a text search in our mind for the virus names from the registry. I'll list all that appear:
- Avian Influenza Virus (first in Overview)
- Porcine Reproductive and Respiratory Syndrome Virus (first in Overview)
- Bovine Leukemia Virus (first in Overview)
- Infectious Bursal Disease Virus (first in Overview)
- African Swine Fever Virus (first in Data Ecosystem)
- Rabies Lyssavirus (first in Data Ecosystem)
- Bluetongue Virus (first in Data Ecosystem)
- African Horse Sickness Virus (first in Data Ecosystem)
- Nipah Virus in Pigs (first in Feature Engineering)
- Lumpy Skin Disease Virus (first in Representation Learning)
- Equine Influenza A Virus (first in Representation Learning)
- Foot-and-Mouth Disease Virus (first in Interpretability section)
- Classical Swine Fever Virus (first in Real-Time Pathogen Genotyping)
- Porcine Epidemic Diarrhea Virus (first in From Sequence to Surveillance)
- Canine Herpesvirus 1 (first in Role of Vet Clinical Pathology) - note registry has "Canine Herpesvirus" without 1, but the text says "Canine Herpesvirus 1" which is a specific strain? Actually, registry has "Canine Herpesvirus" only. But there is "Canine Herpesvirus" entry, so we can link that. The text says "Canine Herpesvirus 1" - we'll link to "Canine Herpesvirus" since that's the closest.
- Canine Distemper Virus (first in Biological Insights)
- Feline Calicivirus (first in Biological Insights)
- Bovine Respiratory Syncytial Virus (first in Biological Insights)
- Feline Coronavirus and FIP (first in Biological Insights) - registry has "Feline Coronavirus And Fip" but with "and" capitalized? The text says "Feline Coronavirus and FIP". The registry link is /knowledge/viruses/pet-viruses/feline-coronavirus-and-fip. We'll link that.
- Infectious Salmon Anemia Virus (first in Emerging Frontiers)
- White Spot Syndrome Virus (first in Emerging Frontiers)
- Koi Herpesvirus (first in Emerging Frontiers)
- West Nile Virus in Horses (first in Role of Vet Clinical Pathology) - but we also have "West Nile Virus in Birds" earlier? Let's check: In the same paragraph, it says "West Nile Virus in Birds" then "West Nile Virus in Horses". So first occurrence of West Nile Virus in Birds is before Horses. But "West Nile Virus in Birds" is not exactly "West Nile Virus in Wild Birds". However, the registry has "West Nile Virus" (avian) and "West Nile Virus in Wild Birds". The text says "West Nile Virus in Birds". We could link to "West Nile Virus in Wild Birds" if we consider it the same. But to be safe, we'll link only exact matches. Since "West Nile Virus in Wild Birds" is in registry, and the text says "West Nile Virus in Birds", it's close but not exact. The text also says "West Nile Virus in Horses" exact match. So we link that.
- Newcastle Disease Virus (first in Diagnostic Integration)
- Bovine Coronavirus (first in Sensor Fusion)
- Also "Avian Influenza In Wild Birds" appears? Not.
- "Porcine Reproductive and Respiratory Syndrome Virus" already.
- "Avian Influenza Virus" already.
Also "Dengue Virus" not in registry. "Grass Carp Reovirus" not. "Sendai Virus"? Not. "Mouse Hepatitis Virus"? Not.
What about "Influenza A virus" without species? In the text, there is "Influenza A viruses" in multiple places. That is not a specific animal virus, but a genus. So no link.
Also "H5N1" is not a virus name but a subtype. Not in registry.
We also have "Marburg virus"? Not.
Let's go through the text section by section to ensure we catch all first occurrences.
First section "Overview and Principles":
- Avian Influenza Virus: first in paragraph 2.
- Porcine Reproductive and Respiratory Syndrome Virus: same paragraph.
- Bovine Leukemia Virus: later in same paragraph.
- Infectious Bursal Disease Virus: in unsupervised learning paragraph.
- African Swine Fever Virus: in Data Ecosystem.
- Bluetongue Virus: in Data Ecosystem.
- African Horse Sickness Virus: in Data Ecosystem.
- Rabies Lyssavirus: in Data Ecosystem.
- Nipah Virus in Pigs: in Feature Engineering.
- Lumpy Skin Disease Virus: in Representation Learning.
- Equine Influenza A Virus: in Representation Learning.
- Foot-and-Mouth Disease Virus: in Interpretability section.
Now second section "Pathogen Surveillance and Genomic Data Integration":
Already linked viruses: Avian Influenza, Porcine Reproductive and Respiratory Syndrome, Bovine Leukemia, African Swine Fever, Rabies Lyssavirus, Bluetongue, African Horse Sickness, Nipah in Pigs, Lumpy Skin Disease, Equine Influenza A, Foot-and-Mouth.
New: Classical Swine Fever Virus appears in Real-Time Pathogen Genotyping: "Classical Swine Fever Virus". First occurrence.
Also "Influenza A viruses" general.
"Porcine Epidemic Diarrhea Virus" appears in From Sequence to Surveillance: "Porcine Epidemic Diarrhea Virus". First occurrence.
"Avian Influenza Virus" already linked.
"West Nile Virus in Birds" appears? In this section, there is "West Nile Virus in Birds" in The Role of Veterinary Clinical Pathology? Actually, that's in the next section? Wait, the second section ends with "The Role of Veterinary Clinical Pathology" which is part of Pathogen Surveillance? Yes, that subsection is within section 2. So "West Nile Virus in Birds" appears there. Also "West Nile Virus in Horses" appears later in the same paragraph. So first occurrence of West Nile virus in birds is in that paragraph. We'll link it to "West Nile Virus in Wild Birds" as close match, but since the exact phrase is not in registry, we might skip? Let's check registry: It has "West Nile Virus" under avian-viruses/west-nile-virus, and also "West Nile Virus in Wild Birds" under wildlife-viruses. The text says "West Nile Virus in Birds". That could be interpreted as the same as "West Nile Virus in Wild Birds". But to be consistent, we should only link if the text matches exactly. Since the text says "West Nile Virus in Birds" and the registry has "West Nile Virus in Wild Birds", it's not exact. However, we do have "West Nile Virus" as a separate entry. The text later says "West Nile Virus in Horses" exact match. So for "West Nile Virus in Birds", we could link to "West Nile Virus" (avian) or to "West Nile Virus in Wild Birds". I think it's safer to link to the avian West Nile Virus page because it's about birds. The registry has "West Nile Virus" under avian-viruses. So we can link that. But the rule says "all mentions of specific animal viruses" - "West Nile Virus" is a specific virus. So we should link the first occurrence of West Nile Virus as a virus name. The first occurrence is "West Nile Virus in Birds" - so we link "West Nile Virus" to its article. However, we have to be careful: The virus name is "West Nile Virus". The text says "West Nile Virus in Birds", but the virus itself is West Nile Virus. So we can link the words "West Nile Virus" in that phrase to /knowledge/viruses/avian-viruses/west-nile-virus. That would be appropriate. Then later when "West Nile Virus in Horses" appears, we link "West Nile Virus" again? The rule says only link the first occurrence of each virus. So if we link "West Nile Virus" at the first occurrence, then later occurrences of "West Nile Virus in Horses" should not be linked because the virus "West Nile Virus" has already been linked. However, "West Nile Virus in Horses" is a separate entry in the registry with a different path. But it's the same virus, just in horses. The rule is ambiguous. The registry has separate entries for "West Nile Virus" and "West Nile Virus in Horses". The text mentions "West Nile Virus in Horses" - that is a specific animal virus reference. It should probably be linked to the horses page. But if we already linked "West Nile Virus" earlier, we might not link it again. To be safe, we treat each distinct registry entry as a separate virus? The instruction says "all mentions of specific animal viruses" and provides a list of virus-link pairs. So we should link each occurrence of a phrase that matches a registry entry exactly. "West Nile Virus in Horses" matches a registry entry, so its first occurrence should be linked. Similarly, "West Nile Virus in Birds" does not match any entry exactly, but "West Nile Virus" does. So we can link "West Nile Virus" in that phrase. Then later, "West Nile Virus in Horses" has a different registry entry, so we link that separately. But note that "West Nile Virus" and "West Nile Virus in Horses" are different strings. So we link each at its first occurrence. That seems reasonable. Also there is "West Nile Virus" in other contexts? In the text, after "West Nile Virus in Birds" and then "West Nile Virus in Horses", there is no other bare "West Nile Virus". So we're fine.
Also in that subsection, "Canine Herpesvirus 1" appears: "Canine Herpesvirus 1". Registry has "Canine Herpesvirus" (without 1). So we link "Canine Herpesvirus" but the text includes "1". We can link only the part "Canine Herpesvirus" to its entry, but the rule says "all mentions of specific animal viruses" - the name is "Canine Herpesvirus 1". There is no entry for "Canine Herpesvirus 1" in registry, only "Canine Herpesvirus". So we should link "Canine Herpesvirus" if it appears, but the text says "Canine Herpesvirus 1". I think we can link the whole phrase "Canine Herpesvirus 1" to the same link as "Canine Herpesvirus" since it's the same virus. Or we can link only "Canine Herpesvirus" leaving "1" outside. The rule doesn't specify. Simpler: we can treat "Canine Herpesvirus 1" as a mention of canine herpesvirus, so we link the words "Canine Herpesvirus" to its article. But the text says "Canine Herpesvirus 1" as a compound. I'll link the entire phrase "Canine Herpesvirus 1" to the canine herpesvirus link. The registry doesn't have that exact string, but we can use the link for "Canine Herpesvirus". I think that's acceptable.
Also "Canine Distemper Virus" appears later in Biological Insights section. That's a separate virus.
Also "Feline Calicivirus" appears in Biological Insights.
Also "Bovine Respiratory Syncytial Virus" appears in Biological Insights.
Also "Feline Coronavirus and FIP" appears in Biological Insights.
Also "Infectious Salmon Anemia Virus" appears in Emerging Frontiers.
Also "White Spot Syndrome Virus" appears in Emerging Frontiers.
Also "Koi Herpesvirus" appears in Emerging Frontiers.
Also "Newcastle Disease Virus" appears in Diagnostic Integration section.
Also "Bovine Coronavirus" appears in Sensor Fusion section.
Also "Avian Influenza Virus" already linked.
Also "Porcine Reproductive and Respiratory Syndrome Virus" already.
Also "Lumpy Skin Disease Virus" already.
Also "Equine Influenza A Virus" already.
Also "Foot-and-Mouth Disease Virus" already.
Also "Classical Swine Fever Virus" already.
Also "Porcine Epidemic Diarrhea Virus" already.
Also "Rabies Lyssavirus" already.
Also "Bluetongue Virus" already.
Also "African Horse Sickness Virus" already.
Also "African Swine Fever Virus" already.
Also "Nipah Virus in Pigs" already.
Also "Bovine Leukemia Virus" already.
Also "Infectious Bursal Disease Virus" already.
Now, we also need to check for "Porcine Reproductive and Respiratory Syndrome" abbreviated as PRRSV - we already linked full name, so no need.
Also "Avian Influenza" as shorthand - already linked.
Now, we proceed to edit the article.
We'll go through the entire body text and add markdown links for the first occurrence of each virus from the registry. We'll keep track using a list.
Let's create a list of first occurrences in order of appearance in the text (approximate):
- Avian Influenza Virus (Overview paragraph 2)
- Porcine Reproductive and Respiratory Syndrome Virus (same paragraph)
- Bovine Leukemia Virus (same paragraph later)
- Infectious Bursal Disease Virus (Overview, unsupervised learning paragraph)
- African Swine Fever Virus (Data Ecosystem)
- Rabies Lyssavirus (Data Ecosystem)
- Bluetongue Virus (Data Ecosystem)
- African Horse Sickness Virus (Data Ecosystem)
- Nipah Virus in Pigs (Feature Engineering)
- Lumpy Skin Disease Virus (Representation Learning)
- Equine Influenza A Virus (Representation Learning)
- Foot-and-Mouth Disease Virus (Interpretability section)
- Classical Swine Fever Virus (Real-Time Pathogen Genotyping)
- Porcine Epidemic Diarrhea Virus (From Sequence to Surveillance)
- West Nile Virus in Birds? Actually, in "The Role of Veterinary Clinical Pathology" within section 2, there is "West Nile Virus in Birds" - but we can link "West Nile Virus" as first occurrence. The phrase "West Nile Virus" appears as part of "West Nile Virus in Birds". So we'll link "West Nile Virus" to the avian page. Then later "West Nile Virus in Horses" appears, we'll link that separately as a different entry.
- West Nile Virus in Horses (same paragraph)
- Canine Herpesvirus 1 (same paragraph) - we'll link to Canine Herpesvirus.
- Canine Distemper Virus (Biological Insights)
- Feline Calicivirus (Biological Insights)
- Bovine Respiratory Syncytial Virus (Biological Insights)
- Feline Coronavirus and FIP (Biological Insights)
- Infectious Salmon Anemia Virus (Emerging Frontiers)
- White Spot Syndrome Virus (Emerging Frontiers)
- Koi Herpesvirus (Emerging Frontiers)
- Newcastle Disease Virus (Diagnostic Integration)
- Bovine Coronavirus (Sensor Fusion)
Also check if "Porcine Reproductive and Respiratory Syndrome Virus" appears earlier than the Overview? No.
Also "Avian Influenza Virus" appears many times, only first linked.
Also note that "Influenza A virus" generic not linked.
Now we need to be careful with the exact text. For some, the text may include additional words like "the" or "a". We only link the virus name itself.
We'll now write the corrected markdown. We'll preserve all content, just add links and reformat references.
Let's start with the frontmatter. No changes.
Then the body text. We'll go through paragraph by paragraph and add links. Since the text is long, we'll make the modifications in the output.
Also ensure no deletion. We'll also check for em dashes. I'll quickly scan for any en dash in the body text: In the "Data Ecosystem" paragraph, there is "2001-2003"? Actually not. In "Feature Engineering" there is "H3Nx" with no dash. In "Model Selection", there is "AUROC of 0.95" etc. No en dashes. In the references, many DOI links have hyphens, but there might be en dashes in page ranges? I see "pp13-28" in reference [52]? Actually, reference [52] says "pp13-28" with hyphen. But in the original provided text, it's "pp13-28" - no en dash. In reference [59], "pp. 1-15" probably. I think all are hyphens. However, in the references there is one: "2000;69:531-569" appears in the examples but not in the actual references provided. The actual references have dates like "2025" etc. So no en dashes visible.
Thus, only changes are virus linking and references numbering.
Let's now produce the final output.--- title: "Machine Learning Algorithms for Predicting Veterinary Viral Outbreaks" primary_keyword: "Mach" lastUpdated: "2026-05-26" category: "emerging-tech" metaDescription: "An in-depth, peer-reviewed academic analysis of Machine Learning Algorithms for Predicting Veterinary Viral Outbreaks in veterinary diagnostics."
Machine Learning Algorithms for Predicting Veterinary Viral Outbreaks
Overview and Principles of Machine Learning Algorithms for Predicting Veterinary Viral Outbreaks
The application of machine learning (ML) to veterinary virology represents a paradigm shift from reactive, post-hoc outbreak management to proactive, data-driven prediction and prevention. As a veterinary clinical pathologist, I must emphasize that the fundamental principle underpinning this transition is the capacity of ML algorithms to discern complex, non-linear, and often cryptic patterns within high-dimensional datasets-patterns that are frequently imperceptible to conventional statistical methods or human intuition. The prediction of viral outbreaks in animal populations is not merely an exercise in computational modeling; it is a biological and ecological challenge that requires the integration of virology, immunology, epidemiology, climatology, and animal husbandry. The ML algorithms deployed for this purpose are not monolithic; rather, they constitute a diverse toolkit, each class of algorithm possessing distinct assumptions, strengths, and limitations that must be meticulously matched to the specific biological question and data structure at hand.
The Foundational Paradigms of Machine Learning in Veterinary Virology
At its core, ML for outbreak prediction can be categorized into three overarching paradigms: supervised learning, unsupervised learning, and reinforcement learning, with the former two being of paramount importance in the current context. Supervised learning algorithms are trained on labeled datasets, where the outcome of interest-for instance, the presence or absence of Avian Influenza Virus on a farm, or the classification of a Porcine Reproductive and Respiratory Syndrome Virus outbreak as high or low impact-is known. The algorithm learns a mapping from input features (e.g., temperature, humidity, stocking density, biosecurity scores, genomic sequences) to the target label. This paradigm is exemplified by studies using random forest (RF) and gradient boosting machines (GBM) to predict Bovine Leukemia Virus seropositivity in dairy cattle, where the RF model achieved an area under the receiver operating characteristic curve (AUROC) of 0.93, identifying age and geographic location as dominant predictors [10]. Similarly, supervised models have been instrumental in predicting the clinical impact of PRRSV outbreaks in sow herds, though with modest accuracy, highlighting the inherent difficulty of linking viral genetic data (ORF-5 sequences) to complex phenotypic outcomes [9, 18].
In contrast, unsupervised learning algorithms operate on unlabeled data, seeking to discover inherent structures, clusters, or anomalies without prior knowledge of outcomes. This is particularly valuable for early warning systems where the "signal" of an emerging outbreak is unknown. For example, unsupervised clustering of spike protein sequences using Levenshtein distance has been employed to identify emerging persistent variants of SARS-CoV-2, serving as an early warning for new waves of infection [22]. In a veterinary context, such approaches could be applied to detect novel reassortant strains of Influenza A Virus in Cats or emerging pathotypes of Infectious Bursal Disease Virus before they cause widespread clinical disease. Reinforcement learning, while less commonly applied in current veterinary outbreak prediction, holds future promise for optimizing dynamic intervention strategies, such as adaptive vaccination schedules or culling protocols, by learning optimal policies through trial-and-error interactions with a simulated environment.
The Data Ecosystem: Sources, Heterogeneity, and Preprocessing Challenges
The efficacy of any ML model is inextricably linked to the quality, breadth, and representativeness of the data on which it is trained. The data ecosystem for veterinary viral outbreak prediction is exceptionally heterogeneous, encompassing:
- Genomic and Sequence Data: Viral genomes (e.g., hemagglutinin sequences for influenza, ORF-5 for PRRSV, whole-genome sequences for African Swine Fever Virus) provide direct evidence of viral evolution, virulence determinants, and host adaptation. Machine learning models have been trained to predict animal hosts from viral genome composition biases, achieving ~73% accuracy for coronaviruses and demonstrating that spike protein sequences are as informative as whole genomes [16]. For influenza H3Nx viruses, a gradient boosting machine classifier achieved 98% accuracy in identifying the most likely host species from hemagglutinin sequences [13].
- Epidemiological and Surveillance Data: This includes case reports, outbreak notifications, mortality records, and diagnostic laboratory submissions. Such data are often subject to significant reporting biases, particularly in low-resource settings. Machine learning models, such as extreme gradient boosting (XGBoost) combined with oversampling techniques, have been used to improve the understanding of Rabies Lyssavirus epidemiology in Haiti, where a large proportion of animal investigations lack laboratory confirmation [11].
- Environmental and Climatic Data: Temperature, precipitation, humidity, vegetation indices (NDVI), and solar radiation are critical drivers of vector-borne viral diseases. For instance, random forest models have been used to map the distribution of Culicoides midges, the vectors for Bluetongue Virus and African Horse Sickness Virus, identifying cattle density and water vapor pressure as key predictors [14]. Similarly, the concept of waterbird activity entropy, derived from the monthly distributions of 779 waterbird species, has demonstrated high explanatory power (AUC = 0.87) for global Avian Influenza Virus risk, particularly for H5N1 [5].
- Host and Management Factors: Data on animal demographics (age, breed, species), farm management practices (biosecurity protocols, vaccination history, stocking density), and animal movement patterns are crucial. A study on PRRSV control in Japanese breeding farms used random forest analysis to prioritize biosecurity practices, revealing that semen management and controlled barn environments were critical for control, while stringent management of replacement gilts was essential for eradication [4].
- Wearable Biosensor and Physiological Data: The advent of precision livestock farming has generated real-time streams of physiological data (heart rate, respiratory rate, body temperature, activity levels). Machine learning algorithms, including support vector machines (SVM), RF, and long short-term memory (LSTM) networks, can detect anomalies in these data streams, potentially predicting disease onset 2-3 days before clinical signs appear [3, 17].
The integration of these disparate data types presents formidable preprocessing challenges. Data must be cleaned, normalized, and imputed to handle missing values. Temporal alignment is critical, as climatic data may be available at daily resolution while outbreak data is aggregated monthly. Spatial heterogeneity must be accounted for, often through the inclusion of geographic coordinates or the use of spatial cross-validation techniques to prevent overfitting due to spatial autocorrelation [7].
Feature Engineering and Representation Learning: The Bridge Between Raw Data and Predictive Insight
The raw data described above are rarely in a form directly amenable to ML algorithms. Feature engineering is the process of transforming raw data into informative, non-redundant, and discriminative features that capture the underlying biological signal. For genomic data, this might involve calculating dinucleotide frequencies, codon usage biases, or the presence of specific amino acid motifs in receptor-binding domains. For example, the prediction of animal hosts for coronaviruses relied on compositional biases of spike protein and whole genome sequences, including dinucleotide and codon usage biases [16]. For environmental data, features might include moving averages of temperature, cumulative rainfall over a defined period, or derived indices like the Oceanic Niño Index, which was identified as a key predictor of food shortages for flying foxes and subsequent Nipah Virus in Pigs spillover risk [6].
Representation learning, particularly through deep learning, offers an alternative to manual feature engineering by automatically learning hierarchical representations from raw data. Convolutional neural networks (CNNs) can learn spatial features from diagnostic images, such as chest radiographs for Equine Influenza A Virus or skin lesions for Lumpy Skin Disease Virus, achieving accuracies exceeding 99% in some studies [12]. Recurrent neural networks (RNNs) and LSTM networks are adept at learning temporal dependencies in time-series data, such as the progression of daily mortality rates or the lagged relationship between climatic variables and Dengue virus incidence [19]. More recently, transformer-based architectures, originally developed for natural language processing, are being adapted for genomic sequence analysis, treating nucleotide sequences as a language and learning long-range dependencies that are critical for understanding viral evolution and host adaptation [1].
Model Selection, Training, and Validation: Navigating the Veterinary Context
The choice of ML algorithm is dictated by the nature of the prediction task (classification, regression, clustering), the size and dimensionality of the data, the interpretability requirements, and the computational resources available. Tree-based ensemble methods, such as Random Forest and Gradient Boosting Machines (XGBoost, LightGBM, CatBoost), have emerged as workhorses in veterinary outbreak prediction due to their robustness to non-linear relationships, ability to handle mixed data types, and inherent feature importance metrics. XGBoost, for instance, has been successfully applied to predict Avian Influenza Virus outbreaks in laying farms, achieving 84.62% accuracy and enabling the use of SHAP (SHapley Additive exPlanations) values to interpret the contribution of management factors like vaccination and disinfection [8]. LightGBM demonstrated superior performance in predicting pathogen prevalence in seafood, with 97.6% specificity and 99.9% positive predictive value [2]. CatBoost excelled in classifying the etiology of bacterial meningitis, achieving an AUROC of 0.95 for binary classification [20].
Support Vector Machines (SVM) are powerful for high-dimensional spaces and are particularly effective when the number of features exceeds the number of samples, a common scenario in genomic prediction. SVM has been used for virus host prediction from metagenomic features, with accuracy improved from 80% to 84% through grid search and cross-validation optimization [21]. Deep learning models, including CNNs and LSTM networks, are preferred when dealing with raw image, signal, or sequence data, but they require large, well-annotated datasets and substantial computational power, which can be a limiting factor in many veterinary settings.
A critical and often underappreciated aspect of ML in veterinary virology is model validation. Standard k-fold cross-validation can be overly optimistic when data are temporally or spatially correlated. For outbreak prediction, temporal cross-validation (training on past data, testing on future data) is essential to evaluate the model's true predictive performance in an operational setting [7]. Spatial cross-validation (e.g., leaving out entire geographic regions) is necessary to assess the model's ability to generalize to new areas. Furthermore, the class imbalance problem-where outbreaks are rare events compared to non-outbreak periods-must be addressed through techniques such as oversampling (e.g., SMOTE), undersampling, or the use of cost-sensitive learning algorithms [11, 15].
Interpretability and the Path to Clinical Translation
For a veterinary clinical pathologist, a model that provides a prediction without an explanation is of limited utility. The "black box" nature of many ML algorithms, particularly deep learning, has been a major barrier to clinical adoption. Interpretable machine learning (IML) techniques, such as SHAP and LIME (Local Interpretable Model-agnostic Explanations), are therefore indispensable. SHAP values, grounded in cooperative game theory, provide a unified measure of feature importance, revealing how each predictor contributes to a specific prediction. This has been used to demonstrate that sheep density is the dominant predictor for Foot-and-Mouth Disease Virus serotype O outbreaks in the MENA region, while buffalo density and proximity to roads are more important for serotype A [7]. Similarly, SHAP analysis of a LightGBM model for AIV prediction identified the top five most influential features, including vaccination status and disinfection frequency, providing actionable insights for farm management [8].
The ultimate goal is to translate these predictive models into clinical decision support systems (CDSS) that can be integrated into veterinary practice and public health surveillance. This requires not only high predictive accuracy but also robustness, generalizability, and user-friendly interfaces. The World Organisation for Animal Health (WOAH) and the Food and Agriculture Organization (FAO) have emphasized the need for such tools to strengthen early warning systems for transboundary animal diseases. The path forward involves close collaboration between veterinary virologists, epidemiologists, data scientists, and field practitioners to ensure that ML models are grounded in biological reality, trained on representative data, and validated under real-world conditions. The principles outlined
Pathogen Surveillance and Genomic Data Integration for Outbreak Prediction
The paradigm of veterinary outbreak prediction has undergone a profound transformation, shifting from retrospective epidemiological analysis to prospective, data-driven forecasting. Central to this evolution is the convergence of systematic pathogen surveillance and high-resolution genomic data, integrated through sophisticated machine learning (ML) and deep learning (DL) architectures. As a veterinary clinical pathologist, I contend that the most robust predictive models are no longer those that simply correlate environmental variables with case counts, but rather those that mechanistically incorporate the evolutionary trajectory, host adaptation, and functional impact of the pathogen itself. This section dissects the architectures, biological underpinnings, and practical integration of genomic surveillance data into predictive frameworks.
Real-Time Pathogen Genotyping and Early Warning Systems
The foundational tenet of genomic integration is the ability to detect and quantify the emergence of novel viral variants before they achieve epidemiological dominance. Traditional Sanger sequencing, while still valuable for phylogenetic inference, lacks the throughput and temporal resolution required for real-time outbreak prediction. This void has been filled by next-generation sequencing (NGS) and, critically, by ML algorithms designed to extract predictive signals from dense genomic datasets. For instance, the application of unsupervised machine learning using Levenshtein distance on spike protein sequences enabled the definition of "persistent variants" and the early warning of emerging variants of concern for SARS-CoV-2, with detection achievable once the associated cluster reached a mere 1% of time-binned sequence data [22]. This methodological approach is directly translatable to veterinary virology. We can envision analogous systems for Avian Influenza Virus, where continuous surveillance of the hemagglutinin (HA) and neuraminidase (NA) genes in both wild and domestic birds could trigger early warnings for the emergence of highly pathogenic strains. The core biological principle here is that viral fitness, dictated by receptor binding avidity, immune escape, and replication efficiency, is imprinted in the genome sequence [26]. By training models on time-stamped sequence data and correlating genetic distance metrics with epidemiological growth rates, we can create a "variant fitness score" that predicts which lineages are likely to cause the next wave.
The depth of this integration extends beyond simple clustering. Deep mutational scanning (DMS) data, which systematically measures the functional impact of every possible single amino acid mutation on traits like ACE2 binding or antibody escape, provides a mechanistic bridge between genotype and phenotype. As demonstrated in human epidemiology, integrating DMS-derived viral traits with socio-demographic context allowed for the prediction of SARS-CoV-2 lineage fitness with an R² of 0.786 using XGBoost models [26]. For veterinary applications, this is revolutionary. Consider Porcine Reproductive and Respiratory Syndrome Virus (PRRSV), where the ORF-5 sequence is routinely analyzed for prognostic purposes. Current ML models using ORF-5 nucleotide and amino acid sequences, combined with demographic data, have shown only modest accuracy (74.4% for pre-weaning mortality prediction) [9, 18]. However, this accuracy is likely to be significantly enhanced by incorporating a comprehensive DMS-derived landscape of GP5 (the major envelope protein) function, including its role in receptor binding (CD163) and antibody neutralization. The current limitation is not the algorithmic approach, but the poverty of functional annotation for the mutations observed in the field. A concerted effort to generate DMS data for key veterinary pathogens-including Foot-and-Mouth Disease Virus, Classical Swine Fever Virus, and African Swine Fever Virus-would provide the fuel for the next generation of predictive models.
Predictive Modeling of Host Range and Cross-Species Transmission Risk
One of the most critical, yet most challenging, aspects of outbreak prediction is anticipating spillover events from wildlife reservoirs into domestic animals or humans. Genomic data carries an evolutionary signature of host adaptation that can be decoded by ML. The seminal work by Babayan et al. demonstrated that machine learning models trained on genome composition biases-including dinucleotide and codon usage-could predict the reservoir host and the presence and identity of an arthropod vector for single-stranded RNA viruses [28]. This approach leverages the profound coevolutionary pressures that shape viral genomes; a virus replicating in a bat cell faces a different selective landscape (e.g., CpG dinucleotide suppression driven by ZAP proteins) than one replicating in a bird or swine. This principle has been extended and refined. For influenza A viruses, gradient boosting classifiers trained on hemagglutinin (HA) sequences achieved 98.0% accuracy in identifying the most likely host species, from avian to swine to canine [13]. Similarly, an analysis of whole genome and spike protein sequences of coronaviruses using random forest models achieved ~73% accuracy in predicting the animal host (e.g., bat, camelid, carnivore) [16]. The practical implication for surveillance is profound: when a novel virus is sequenced from an environmental sample or a human index case, these models can provide an immediate, data-driven hypothesis for the likely animal origin, directing field investigations and biosecurity measures. This is particularly relevant for high-consequence transboundary pathogens. For example, the prediction of Nipah Virus in Pigs spillover from Pteropus bats could be enhanced by analyzing the genomic signatures of circulating fruit bat paramyxoviruses, flagging those with a "swine-like" evolutionary profile for intensified surveillance.
Furthermore, the integration of receptor binding predictions with ecological data creates a powerful triage system for risk assessment. The MrSARS (Multi-reference Similarity Analysis of Receptor Sequences) pipeline, which compares host receptor sequences (e.g., ACE2) to those of known susceptible species, provides a rapid in silico screen for potential host range [25]. When combined with ML models predicting Lumpy Skin Disease Virus (LSDV) distribution based on vector ecology and climate [12, 14], or African Horse Sickness Virus risk based on Culicoides abundance [14], we move from a static risk map to a dynamic, evolution-aware prediction. The biological logic is elegant: the virus cannot spill over if its glycoprotein cannot bind the host receptor. By prioritizing surveillance resources toward animal species that are both genomically predicted to be susceptible and located in ecologically suitable regions for the vector, we can dramatically increase the efficiency of early detection systems.
Integrating Environmental and Ecological Surveillance Data with Genomic Signals
Genomic data does not exist in a vacuum; its predictive power is amplified when integrated with high-resolution spatiotemporal environmental and ecological data. The concept of "pathogen surveillance" must be expanded beyond the laboratory to include the landscape ecology of the host and vector. A landmark study on highly pathogenic avian influenza (HPAI) clade 2.3.4.4b used Bayesian additive regression trees (BART) to model risk across Europe, finding that while climate and physical geography explained most of the variation, the explicit inclusion of wild bird ecology-including species richness, abundance of specific taxa, and "abundance indices" for high-risk foraging behaviors-was a valuable refinement [24]. The biological mechanism is intuitive: the risk of transmission is not merely a function of viral presence, but of the density and behavior of competent hosts at a given time and place. This concept is elegantly captured by the Waterbird Activity Entropy (WAE) index, which uses monthly distributions of 779 waterbird species to create a dynamic measure of infection pressure [5]. With an AUC of 0.87 for predicting global avian influenza cases, this framework demonstrated that 52% of the globally exposed human population, 41% of cattle, and 51% of poultry reside within AIV exposure hotspots identified by WAE [5].
The integration of genomic signals with such ecological indices creates a predictive engine of remarkable potency. For instance, the prediction of food shortages for Australian flying foxes (Pteropus spp.)-a key driver of Hendra virus spillover-has been achieved with over 92% accuracy using ML models trained on climatological (e.g., Oceanic El Niño Index) and bat-level features (e.g., body weight, rescue center admissions) [6]. The spillover mechanism is a cascade: El Niño-driven drought reduces flowering and nectar availability (food shortage), which forces bats into fragmented remnant habitats and peri-urban areas, increasing contact with horses, the intermediate host for Nipah Virus in Pigs and Hendra virus. By feeding these ecological alert signals into a model that concurrently analyzes viral genome evolution for markers of increased shedding or virulence, we can move from predicting when a spillover might occur (ecological trigger) to predicting which circulating viral strain is most likely to be the culprit (genomic trigger). This dual-pronged approach represents the apotheosis of One Health surveillance.
From Sequence to Surveillance: Data Pipelines and Computational Archetypes
The practical implementation of genomic integration requires robust, scalable data pipelines. The sheer volume of sequencing data-from metagenomics of environmental samples to targeted amplicon sequencing of clinical specimens-necessitates automated computational workflows. Feature engineering is the critical bridge between raw genomic data and ML models. For viral host prediction, features are often derived from genome composition (e.g., GC content, dinucleotide frequencies) and k-mer spectra [16, 21]. For variant impact prediction, features are derived from DMS scores, structural modeling, and evolutionary conservation [26]. The choice of ML algorithm is also crucial. A comparative analysis of models for predicting Bovine Leukemia Virus (BLV) seropositivity found that Random Forest (AUROC = 0.93) significantly outperformed logistic regression, identifying an age ≥ 5 years and geographic location as the top predictors [10]. Similarly, for predicting clinical impact of PRRSV, extreme gradient boosting (XGBoost) outperformed logistic regression, although accuracy was insufficient for field implementation, suggesting a need for richer feature sets [9].
The concept of "reflective diagnostics" offers a forward-looking framework for integration [23]. This paradigm combines CRISPR-based molecular detection (providing a rapid, binary signal of pathogen presence), interpretive biosensing (contextualizing the signal with environmental metadata), and AI-driven learning loops (continuously updating decision rules). For a veterinary field application, this could translate to a point-of-care device that detects Porcine Epidemic Diarrhea Virus (PEDV) RNA in a fecal sample via CRISPR, simultaneously logs the barn temperature and humidity, and uses a local ML model to output not just a "positive/negative" result, but a dynamic risk score for within-herd spread. This moves surveillance from a passive reporting system to an active, adaptive, and site-specific management tool. The success of such systems, however, hinges on data standardization and model interpretability. The use of SHAP (Shapley Additive exPlanations) values, as demonstrated for predicting Avian Influenza Virus outbreaks on laying farms, is critical for building trust and understanding among veterinarians and producers [8]. By revealing that specific combinations of vaccination history, disinfection frequency, and farm management practices were the top features driving predictions, SHAP analysis provides actionable, auditable insights.
The Role of Veterinary Clinical Pathology in the Genomic Surveillance Feedback Loop
As a clinical pathologist, I must emphasize that the quality of genomic data is inextricably linked to the quality of the sample and the diagnostic process. The integration of laboratory data-including hematology, clinical chemistry, and serology-into ML models is a force multiplier. A study combining laboratory data (e.g., from blood samples) with risk factors in a neural network achieved 94% accuracy in predicting bovine neosporosis [27]. This principle can be extended to genomic surveillance. When a surveillance system flags a region as high-risk based on environmental and genomic signals, the next logical step is enhanced clinical pathology screening. For instance, in areas predicted for West Nile Virus in Birds or West Nile Virus in Horses, veterinarians could be alerted to perform serology (IgM capture ELISA) or RT-PCR on animals presenting with acute febrile or neurologic signs. The results of these clinical tests then feed back into the ML model, refining its predictions in a continuous reinforcement loop. This creates a dynamic, learning surveillance system that becomes more accurate over time.
Moreover, the digital pathology revolution is adding a new data stream. Deep learning models analyzing histological sections for characteristic lesions (e.g., syncytial cells in FIP, intranuclear inclusion bodies from Canine Herpesvirus 1) can provide a near-instantaneous diagnostic signal that corroborates or refutes the genomic prediction [3]. The integration of these multimodal data streams-genomic, environmental, clinical-pathological-within a unified ML framework will ultimately define the standard of excellence for veterinary outbreak prediction. The path forward requires not only algorithmic innovation but also a fundamental cultural shift within the veterinary profession towards routine, systematic data collection and sharing, adhering to FAIR (Findable, Accessible, Interoperable, Reusable) data principles as advocated by WOAH. Only then can the immense promise of genomic data integration be fully realized for safeguarding animal and public health.
Clinical Application and Performance Metrics of Predictive Models
The translation of machine learning (ML) models from theoretical constructs and retrospective research into actionable clinical and field-deployable tools represents the most critical juncture in the fight against veterinary viral outbreaks. The performance of these models is not merely an academic exercise in maximizing the area under the receiver operating characteristic curve (AUC-ROC); it is a matter of economic viability, animal welfare, and, in the case of zoonotic pathogens, public health security. This section provides an exhaustive analysis of how predictive models are being operationalized across diverse veterinary contexts-from intensive livestock operations to aquaculture facilities and wildlife surveillance-with a rigorous focus on the metrics that define their clinical utility, the biological mechanisms they must capture, and the inherent limitations that temper their deployment.
Defining Clinical Utility: Beyond the AUC-ROC
While AUC-ROC remains a ubiquitous benchmark, its clinical relevance in veterinary outbreak prediction is often overstated, particularly in the context of highly imbalanced datasets where outbreaks are rare events. A model with an AUC of 0.95 may be clinically useless if its positive predictive value (PPV) is low, leading to an unacceptable number of false alarms that erode farmer trust and waste resources. The true clinical utility of a predictive model is defined by a suite of metrics that must be interpreted within the specific epidemiological and economic context of the target disease.
For instance, in the prediction of Avian Influenza Virus outbreaks in Chinese laying hen farms, Zhuang et al. (2025) demonstrated that while a LightGBM model achieved an accuracy of 84.62% and an F1-score of 85.00%, the precision of 80.95% and recall of 89.47% provided a more nuanced picture [8]. The high recall (sensitivity) is clinically paramount in this context: missing a true outbreak (a false negative) could lead to catastrophic viral dissemination and massive economic losses. Conversely, a lower precision (80.95%) implies that nearly one in five alerts would be a false positive, potentially triggering unnecessary culling, quarantine, or vaccination campaigns. The authors' use of SHAP (SHapley Additive exPlanations) to identify top features-such as vaccination status and disinfection frequency-is a critical step toward clinical interpretability, allowing veterinarians to understand why a farm is flagged as high-risk [8].
This principle is echoed in the work of Akiyama and Sasaki (2025) on Porcine Reproductive and Respiratory Syndrome Virus (PRRSV) control in Japanese breeding herds. Using Random Forest analysis, they identified that the feature importance for PRRSV control (e.g., semen management, controlled barn environment) differed markedly from that for PRRSV eradication (e.g., stringent management of replacement gilts, PRRS-free semen sources) [4]. The clinical application here is not a binary outbreak prediction but a stratified decision-support tool. The performance metric of interest was not just the model's accuracy in classifying farms as "controlled" or "not controlled," but the clinical actionability of the feature ranking. A farm aiming for eradication must prioritize biosecurity measures that the model identifies as critical for that specific goal, a nuance that a single AUC value cannot capture.
Performance in High-Stakes, Low-Prevalence Scenarios
The prediction of rare, high-consequence events-such as the incursion of African Swine Fever Virus (ASF) or Foot-and-Mouth Disease Virus (FMD)-presents a unique challenge for model performance metrics. In these scenarios, the cost of a false negative is astronomically high, while the cost of a false positive, though significant, is often more tolerable. This necessitates a shift in focus toward sensitivity and the use of techniques like oversampling to address class imbalance.
Keshavamurthy et al. (2024) directly addressed this in the context of Rabies Lyssavirus surveillance in low-resource settings. They compared XGBoost with Logistic Regression, finding that XGBoost with Random Oversampling (XGB-ROS) was the preferred technique for predicting this rare event [11]. The clinical metric of interest was the model's ability to stratify risk. Their XGB-ROS model classified 85.2% of confirmed rabies cases as "high risk" while only 0.01% of non-cases fell into this category [11]. This risk stratification framework is a direct clinical application: it allows public health officials in Haiti to prioritize limited diagnostic resources (e.g., direct fluorescent antibody testing) on animals classified as high or moderate risk, leading to a 3.2-fold increase in epidemiologically useful data compared to standard surveillance [11]. The model's performance is thus measured not by a single accuracy score, but by its operational impact on surveillance yield.
Similarly, in the context of predicting FMD in the Middle East and North Africa (MENA), Alkhamis et al. (2025) employed an interpretable ML framework that achieved accuracies >85% [7]. The clinical value of their model, however, lies in its ability to identify the non-linear relationships and threshold effects of predictors like sheep density and wind patterns. The performance metric here is the model's capacity to generate fine-scale risk maps that guide targeted vaccination and surveillance in under-reported regions like the Southern Arabian Peninsula, a task where traditional epidemiological models often fail [7].
Genomic and Sequence-Based Predictive Models: A New Paradigm for Clinical Metrics
The integration of viral genomic data into predictive models has opened a new frontier in clinical application, moving beyond environmental and demographic risk factors to the intrinsic biology of the pathogen. Performance metrics for these models must assess their ability to infer host range, predict virulence, and identify emerging variants in near real-time.
The work of Babayan et al. (2018) is foundational, demonstrating that machine learning can predict reservoir hosts and arthropod vectors directly from RNA virus genome sequences [28]. The clinical and ecological utility of such a model is profound: upon sequencing a novel virus, the model can immediately flag it as having a high probability of being vector-borne or originating from a specific reservoir, guiding field investigations and risk assessments. The performance metric here is the model's classification accuracy across diverse virus families, which the authors validated against known ecological data.
More recently, Alberts et al. (2024) applied gradient boosting machine classifiers to hemagglutinin sequences of H3 influenza viruses, achieving a 98.0% correct classification rate for host species [13]. The clinical application is two-fold. First, it can rapidly identify the likely origin of a virus isolated from an environmental sample or a novel host, which is critical for early outbreak response. Second, the model's predicted probability scores can highlight sequences of interest-those that show considerable probability for multiple hosts-indicating potential host adaptation events [13]. This metric of "predicted probability for multiple hosts" is a novel performance indicator that directly correlates with zoonotic or cross-species transmission risk.
In the realm of SARS-CoV-2, the ability to predict variant fitness has direct clinical implications for vaccine strain selection and public health policy. Ding and Yuan (2025) combined deep mutational scanning (DMS) data with socio-demographic indices to predict lineage fitness, finding that XGBoost achieved an R² of 0.786 and an accuracy of 0.923 [26]. The critical clinical finding was that immune escape was a stronger driver of fitness in low-resource settings. The performance metric here is not just the model's predictive power, but its ability to reveal context-dependent biological mechanisms that inform equitable vaccine distribution strategies.
Real-Time Monitoring and Wearable Biosensors: Metrics for Continuous Prediction
The advent of wearable biosensors and continuous environmental monitoring has enabled a shift from static risk prediction to dynamic, real-time health assessment. The performance metrics for these models must account for temporal dependencies and the ability to detect subtle physiological deviations before clinical signs emerge.
Khan et al. (2025) proposed a framework integrating wearable biosensors (temperature, heart rate, respiratory rate) with environmental sensors (humidity, ammonia) and ML algorithms (SVM, Random Forest, XGBoost, LSTM) for early livestock disease diagnosis [17]. The clinical metric of interest is the lead time-the interval between the model's anomaly detection and the onset of visible symptoms. A model that can predict sickness in farm animals "up to two to three days before they show symptoms," as noted by Ehsanullah et al. (2026), provides a critical window for intervention [3]. The performance of such systems is evaluated using precision, recall, and F1-score, but the most clinically relevant metric is the predictive horizon and the false alarm rate per animal-day. A high false alarm rate would lead to "alert fatigue" among farm staff, rendering the system useless.
Jlassi et al. (2026) demonstrated a similar concept in humans, using wearable biosensors during a stair-stepping test to predict viral respiratory tract infections. Their gradient-boosting model, using heart rate variability (HRV) features, achieved 70% sensitivity and 77% specificity for predicting inflammation levels [32]. While these metrics are moderate, the clinical application is significant: it provides a non-invasive, scalable tool for self-monitoring. The performance metric is the model's ability to correlate a simple physiological challenge test with a complex immunological state, offering a practical alternative to frequent lab testing.
Diagnostic Integration: SERS and Deep Learning for Point-of-Care Metrics
The convergence of advanced biosensing platforms with deep learning is redefining performance metrics for point-of-care diagnostics. The goal is to achieve laboratory-grade sensitivity and specificity in a field-deployable format.
Yang et al. (2025) developed a label-free SERS platform coupled with a deep learning model (MultiplexCR) for the detection of respiratory virus co-infections. The model achieved 98.6% accuracy in classifying virus co-infections and a mean absolute error of 0.028 for concentration regression, completing the entire detection process in 15 minutes [30]. The clinical performance metrics here are exceptional: the ability to simultaneously identify multiple pathogens (e.g., Avian Influenza Virus and Newcastle Disease Virus in poultry) and quantify their viral loads from a single saliva sample. This multiplexing capability and speed are transformative for outbreak management, allowing for immediate triage and containment.
In a related study, Yang et al. (2024) used an ACE2-based SERS sensor with a CoVari deep learning algorithm to detect and quantify SARS-CoV-2 variants, achieving 99.9% classification accuracy and an R² > 0.98 for concentration quantification [31]. The limit of detection (LOD) of 10.47 PFU/mL for SARS-CoV-2 is a critical performance metric, demonstrating sensitivity comparable to RT-PCR. The clinical application is the rapid identification of variant-specific outbreaks, which is crucial for deploying variant-specific countermeasures.
Limitations and the Gap Between Bench and Barn
Despite these advances, a sobering assessment of performance metrics reveals a persistent gap between model development and clinical deployment. Melmer et al. (2025) provided a cautionary tale in their study on predicting the clinical impact of PRRSV type 2 in Canadian sow herds. Their extreme boosting models achieved only 60.8% accuracy for predicting abortion impact and 74.4% for pre-weaning mortality, with mean sensitivities of 50% and 56.2%, respectively [9]. The authors concluded that while the models showed improvement over baseline, the increase was "not sufficient to warrant field implementation" [9]. This study underscores that a model can be statistically significant without being clinically useful. The performance metrics must meet a threshold of operational acceptability, which is often higher than what is reported in academic literature.
Furthermore, the reliance on retrospective data introduces inherent biases. Megahed et al. (2024, 2025) demonstrated that Random Forest models could predict Bovine Leukemia Virus (BLV) seropositivity with high accuracy (AUROC of 0.93) in dairy cattle [10] and beef cattle [29]. However, the models were heavily dependent on features like age and geographic location, which are proxies for unmeasured confounders such as management practices and vector exposure. The clinical application of such a model is limited if it cannot account for farm-specific biosecurity interventions. The performance metric of "feature importance" must be interpreted with caution, as it may reflect correlation rather than causation.
Finally, the challenge of data standardization and model generalizability remains a major barrier. As noted by Li et al. (2026) in the context of food microbiology, "current challenges include lack of data standardization, poor model interpretability, and scarcity of large-scale annotated datasets" [1]. This is equally true in veterinary virology. A model trained on outbreak data from intensive poultry operations in Southeast Asia may perform poorly when applied to backyard flocks in Sub-Saharan Africa. The ultimate performance metric for any predictive model is its external validity-its ability to maintain predictive accuracy across diverse ecological, genetic, and management contexts. Until this is rigorously demonstrated, the clinical application of these powerful tools will remain promising but not yet fully realized.
Molecular Pathogenesis and Host-Virus Interaction Modeling
The architecture of veterinary viral pathogenesis resides at the confluence of viral molecular machinery, host cellular permissiveness, and the immunological microenvironment. Understanding these interactions at a systems level is foundational for predictive modeling, as the molecular determinants of infection, replication, and transmission serve as the primary variables that machine learning algorithms must learn to generalize across species, tissues, and ecological contexts. The transition from descriptive virology to quantitative, data-driven modeling requires that we deconstruct the host-virus interface into measurable features-receptor binding affinities, innate immune evasion strategies, codon usage biases, and mutation-driven fitness landscapes-that can be encoded as input vectors for computational architectures. This section develops the conceptual bridge between molecular pathogenesis mechanisms and the machine learning frameworks that now enable prospective modeling of viral emergence, host range, and outbreak dynamics.
Molecular Determinants of Host Susceptibility and Viral Tropism
The first critical barrier to viral infection is the interaction between viral surface glycoproteins and host cellular receptors. This lock-and-key mechanism dictates species specificity, tissue tropism, and, ultimately, the potential for cross-species transmission. For the sarbecoviruses, including SARS-CoV-2, the angiotensin-converting enzyme 2 (ACE2) receptor serves as the primary entry portal, and variations in ACE2 orthologs across vertebrate species determine susceptibility. The MrSARS pipeline (Multi-reference Similarity Analysis of Receptor Sequences) developed by Frank et al. [25] leverages sequence similarity analysis across 825 vertebrate ACE2 sequences to predict species susceptibility, demonstrating that primates and even-toed ungulates rank highest, with experimental validation using pseudotyped virus entry assays confirming computational predictions. This approach operationalizes comparative genomics into a predictive framework that can be extended to other receptor-virus systems. Similarly, for the filoviruses, the Niemann-Pick C1 (NPC1) endosomal protein is an essential entry receptor. Lasso et al. [34] conducted combinatorial binding studies across seven filovirus glycoproteins and 81 bat species' NPC1 orthologs, generating 1,484 GP-NPC1 interaction pairs, and integrated these data with machine learning to predict susceptible bat hosts and geographic risk zones for Ebola virus spillover. Critically, the authors demonstrated that GP-NPC1 binding avidity correlates poorly with host phylogeny, indicating that receptor sequence alone is insufficient; structural and biochemical features must be explicitly encoded into predictive models.
The application of these principles to veterinary pathogens is rapidly expanding. For Avian Influenza Virus, the hemagglutinin (HA) protein's receptor-binding specificity-preference for α2,3-linked sialic acids (avian type) versus α2,6-linked sialic acids (mammalian type)-is a well-characterized molecular determinant of host range. Machine learning classifiers trained on HA sequences have achieved remarkable accuracy in predicting host origin. Alberts et al. [13] applied gradient boosting machines to H3 influenza HA sequences, achieving 98.0% correct classification of host species (avian, swine, canine, equine, human), with the analysis of predicted probabilities revealing sequences of interest where host convergence or reassortment events may have occurred. Xu and Wojtczak [35] extended this to a broader range of influenza A subtypes using 5-grams-transformer neural networks on HA sequences, achieving 99.54% AUCPR at higher taxonomic levels. These approaches exploit the evolutionary imprint of host adaptation encoded in the HA gene, capturing not only receptor-binding preferences but also co-evolving changes in glycosylation patterns, pH of fusion activation, and neuraminidase functional compatibility. For Porcine Reproductive and Respiratory Syndrome Virus, the open reading frame 5 (ORF-5) encoding the major envelope glycoprotein GP5 has been the focus of predictive modeling for clinical impact. Melmer et al. [9] used extreme gradient boosting with genetic data represented through multiple correspondence analysis to predict abortion and preweaning mortality outcomes in Canadian sow herds, achieving 74.4% accuracy for high-impact preweaning mortality, though sensitivity remained modest at 56.2%. Chadha et al. [18] demonstrated that random forest and SVM models applied to ORF-5 nucleotide and amino acid sequences could classify clinical impact (high versus low) for abortion and preweaning mortality, but notably, no baseline classifier successfully established genotypic-phenotypic linkages for sow mortality, underscoring the complexity of polygenic host-virus interactions that transcend single-gene analyses.
Viral Genome Compositional Biases as Predictors of Host Ecology
Beyond receptor-mediated entry, the long-term co-evolution of viruses with their hosts leaves genome-wide compositional signatures that can be exploited for host prediction. Brierley and Fowler [16] demonstrated that random forest models trained on dinucleotide and codon usage biases from spike protein and whole genome sequences of coronaviruses could predict animal host category (human, bat, camelid, carnivore, etc.) with ~73% accuracy. Notably, when applied to SARS-CoV-2 human sequences, the models predicted bat hosts (suborder Yinpterochiroptera), supporting bats as the ultimate reservoir. Babayan, Orton, and Streicker [28] extended this principle across 500+ single-stranded RNA viruses, showing that machine learning algorithms trained on genome composition-including GC content, CpG dinucleotide frequencies, and codon adaptation indices-could predict not only reservoir hosts (mammal, bird, invertebrate) but also the existence and identity of arthropod vectors. This is paradigm-shifting for veterinary virology: a single genome sequence from a novel virus, obtained during an outbreak investigation, can be computationally screened to infer whether the virus is vector-borne (e.g., Flavivirus, Alphavirus, Orbivirus) or directly transmitted, guiding immediate surveillance and control measures. For Bluetongue Virus and African Horse Sickness Virus, both transmitted by Culicoides biting midges, such in silico vector prediction from genome composition could prioritize regions for entomological surveillance. Purwono et al. [21] used support vector machines with metagenomic features (genome size, GC%, number of CDS) from 7,326 viral genomes to predict host type, achieving 84% accuracy after grid search and cross-validation optimization. These approaches collectively demonstrate that the molecular pathogenesis of host adaptation is writ large across the viral genome, not confined to receptor-binding domains.
Modeling Immune Evasion, Viral Fitness, and Multi-Wave Outbreak Dynamics
The molecular arms race between host antiviral defenses and viral countermeasures generates the evolutionary dynamics that drive outbreak waves, vaccine escape, and the emergence of variants of concern. For SARS-CoV-2, the integration of deep mutational scanning (DMS) data with machine learning has enabled quantitative dissection of variant fitness. Ding and Yuan [26] combined DMS-derived measurements of ACE2 binding, immune escape, and cell entry with socio-demographic indices across nine countries, finding that immune escape had the strongest association with lineage growth rates in least-developed countries (rate ratio 1.90) and that XGBoost models achieved R² of 0.786 when incorporating national context. This highlights the critical insight that viral fitness is not an intrinsic molecular property but emerges from the interaction between viral traits and host population immunity-a principle directly applicable to veterinary pathogens. For Avian Influenza Virus, the emergence of highly pathogenic H5N1 clade 2.3.4.4b has been associated with mutations in the hemagglutinin cleavage site (acquisition of multibasic amino acids) and the neuraminidase stalk region, which modulate tissue tropism and systemic spread. Hayes et al. [24] used Bayesian additive regression trees to model H5 clade 2.3.4.4b presence across Europe, incorporating wild bird ecology metrics (species richness, abundance indices of high-risk taxa) alongside climate and geography, and demonstrated a shift in persistent year-round risk towards cold, low-lying regions of northwest Europe. The molecular pathogenesis of HPAI-specifically the ability to replicate in peripheral tissues and shed at high titers-fundamentally alters the epidemiological landscape, and machine learning models that fail to incorporate these molecular determinants will miss critical outbreak drivers.
The concept of mutation-driven multi-wave patterns has been formalized by Hoffer et al. [22, 36] through the Mutation epidemiological Renormalization Group (MeRG) framework, which applies Levenshtein distance clustering on spike protein sequences to detect emerging persistent variants. Their algorithm successfully identified the emergence of the Alpha variant at the threshold of 1% frequency in time-binned sequence data, providing an early warning system that could be translated to veterinary viruses with rapid evolution. For Equine Influenza A Virus and Porcine Reproductive and Respiratory Syndrome Virus, both of which exhibit substantial antigenic drift, such variant-aware early warning systems could inform vaccine strain selection and biosecurity interventions. The MeRG framework's ability to define persistent variants as stable chains over time and highlight branching events of epidemiological interest is particularly relevant for Infectious Bursal Disease Virus Variants, where antigenic variants have emerged despite extensive vaccination, necessitating real-time molecular surveillance.
Integrating Molecular Pathogenesis with Spatial and Ecological Modeling
The deepest predictive power emerges when molecular features are integrated with environmental and ecological variables within a single modeling framework. For Foot-and-Mouth Disease Virus, Alkhamis et al. [7] demonstrated that interpretable machine learning models incorporating host density (sheep and buffalo), environmental features (wind, temperature, cropland proximity), and anthropogenic factors could predict serotype-specific risk landscapes in the Middle East and North Africa with accuracies exceeding 85%. The molecular biology of FMDV serotypes-differences in receptor usage (integrins versus heparan sulfate), acid stability, and tissue tropism-manifests as distinct ecological niches. Serotype A risk was primarily influenced by buffalo density and proximity to roads, while serotype O was driven by sheep density and wind patterns, reflecting differences in host range and transmission efficiency encoded at the molecular level. Gunasekera et al. [33] extended this to South Asia using an ensemble of random forest, support vector, and gradient boosting algorithms, identifying production systems, isothermality, and cattle density as top predictors, with accuracy >0.87. The molecular pathogenesis of FMDV-its ability to establish persistent infection in the pharyngeal epithelium of ruminants (the carrier state)-is a critical feature that drives long-distance transmission via animal movement, a variable that must be explicitly encoded in predictive models.
For vector-borne diseases, the link between molecular pathogenesis and ecological modeling becomes bidirectional: viral genetic determinants of vector competence (e.g., midgut infection and escape barriers, salivary gland infection thresholds) interact with vector distribution and abundance to determine outbreak risk. de Klerk et al. [14] modeled Culicoides imicola and C. bolitinos distribution in South Africa using random forest, with cattle density and water vapor pressure as top predictors, explaining 75% and 62% of variance respectively. However, the authors noted that midge densities did not show conclusive correlation with bluetongue or African horse sickness outbreaks, suggesting that viral genetic factors-such as the ability of certain Bluetongue Virus serotypes to replicate at higher temperatures or overcome Culicoides species-specific barriers-must be incorporated. The emerging field of molecular epidemiology, where viral genome sequences are combined with spatial and ecological data in machine learning pipelines, represents the frontier. For Rabies Lyssavirus, Keshavamurthy et al. [11] used XGBoost to estimate rabies probability in biting animals from Integrated Bite Case Management data in low-resource settings, achieving a 3.2-fold increase in epidemiologically useful data compared to routine surveillance. The molecular pathogenesis of rabies-its neurotropism, long incubation period, and salivary gland excretion-dictates the temporal and spatial scales of transmission, and models that capture these features can stratify risk with high sensitivity (85.2% of confirmed cases classified as high risk) and specificity (0.01% of non-cases misclassified).
The One Health framework demands that molecular pathogenesis be modeled not in isolation but as an interconnected system spanning domestic animals, wildlife, vectors, and humans. For Nipah Virus in Pigs, the molecular basis of henipavirus entry requires the viral G glycoprotein to bind ephrin-B2/B3 receptors, which are highly conserved across mammals, explaining the broad host range. Lagergren et al. [6] developed machine learning models that predicted food shortages for Australian flying foxes (Pteropus spp.)-the reservoir host of Hendra virus-with 93.33% accuracy using climatological features and the Oceanic El Niño Index, with predictive signals up to nine months in advance. These ecological predictors of wildlife nutritional stress, which drives bat dispersal and urban encroachment, are mechanistically linked to viral shedding in urine and subsequent spillover to horses and humans. The molecular pathogenesis of Hendra virus-its ability to persist in bat kidneys and be excreted in urine-is the biological link between climate-driven food shortages and outbreak risk, a feature that must be embedded in any predictive model of henipavirus emergence.
Protocol and Methodology for Data Curation, Feature Engineering, and Model Training
The development of robust machine learning (ML) models for predicting veterinary viral outbreaks is fundamentally contingent upon the quality, breadth, and integrity of the underlying data. The protocol for data curation, feature engineering, and model training must be meticulously designed to address the unique complexities of veterinary virology, including multi-host transmission dynamics, environmental drivers, genomic variability, and the often sparse and heterogeneous nature of surveillance data. This section delineates a comprehensive, multi-stage methodology that integrates principles from epidemiological surveillance, molecular biology, and computational science to construct predictive frameworks capable of operational deployment.
Data Acquisition and Source Stratification
The foundational step involves the systematic acquisition of data from a diverse array of sources, categorized into distinct strata to capture the full spectrum of factors influencing viral emergence and spread. The primary strata include:
Epidemiological and Surveillance Data: This stratum forms the core of any outbreak prediction model. Data must be sourced from national and international veterinary health agencies, including the World Organisation for Animal Health (WOAH, formerly OIE) and the Food and Agriculture Organization (FAO). Datasets should include confirmed outbreak locations, dates, species affected, number of cases, and control measures implemented. For example, studies on African Swine Fever Virus and Highly Pathogenic Avian Influenza Virus along the Korean Demilitarized Zone utilized outbreak records from national databases to train MaxEnt models, achieving high predictive power (AUC > 0.92) [42]. Similarly, the prediction of Porcine Reproductive and Respiratory Syndrome Virus clinical impact relied on retrospective longitudinal data from sow herds, linking ORF-5 sequences to production parameters [9, 18].
Environmental and Climatic Data: Viral transmission, particularly for vector-borne and environmentally persistent pathogens, is heavily modulated by climatic variables. Essential features include temperature, precipitation, humidity, solar radiation, and the Normalized Difference Vegetation Index (NDVI). These are typically sourced from satellite remote sensing platforms (e.g., Copernicus, MODIS) and global climate databases. For instance, the distribution of Culicoides midges, vectors for Bluetongue Virus and African Horse Sickness Virus, was modeled using random forest with climate variables (temperature, water vapor pressure) and NDVI as key predictors [14]. The waterbird activity entropy (WAE) index, derived from monthly distributions of 779 waterbird species, was shown to be a powerful predictor of global Avian Influenza Virus risk, integrating environmental and ecological data [5].
Genomic and Pathogen Data: The genetic makeup of the pathogen dictates its transmissibility, virulence, and host range. This stratum includes whole-genome sequences, specific gene sequences (e.g., hemagglutinin for influenza, ORF-5 for PRRSV, spike protein for coronaviruses), and derived features such as codon usage bias, dinucleotide frequencies, and mutational profiles. A landmark study demonstrated that machine learning could predict animal reservoirs and arthropod vectors directly from the genome sequences of single-stranded RNA viruses, including Rabies Lyssavirus and West Nile Virus, with high accuracy [28]. For SARS-CoV-2, deep mutational scanning (DMS) data on viral traits (ACE2 binding, immune escape) were integrated with socio-demographic indices to predict lineage fitness across countries [26]. The use of spike protein sequences and Levenshtein distance enabled the unsupervised clustering of emerging variants, providing early warning signals for new waves of COVID-19 [22, 36].
Host and Demographic Data: Susceptible host distribution, density, and movement patterns are critical. This includes livestock census data (cattle, swine, poultry densities), wildlife population estimates (e.g., wild boar, waterfowl), and farm-level management practices. For Foot-and-Mouth Disease Virus in the MENA region, sheep density emerged as the dominant predictor, while buffalo density was key for serotype A [7]. In predicting Bovine Leukemia Virus seropositivity, age and geographic location were identified as the most important features [10, 29]. Anthropogenic factors, such as human population density and road proximity, were found to be more predictive of SARS-CoV-2 infection risk in animals than biophysical factors [41].
Socioeconomic and Biosecurity Data: Farm-level biosecurity practices, vaccination coverage, and market chain dynamics are often overlooked but are crucial for predicting outbreak occurrence and severity. A study on PRRSV control in Japanese breeding farms used a biosecurity assessment tool and random forest analysis to identify that semen management and controlled barn environments were critical for control, while stringent management of replacement gilts was essential for eradication [4]. For Avian Influenza Virus in Chinese laying farms, data on vaccination and disinfection practices were incorporated into a LightGBM model, significantly improving predictive performance (accuracy 84.62%) [8].
Data Preprocessing and Quality Control
Raw data from disparate sources are inherently noisy, incomplete, and heterogeneous. A rigorous preprocessing pipeline is mandatory.
- Handling Missing Data: Missing values are common in surveillance data. Strategies include imputation using mean/median for continuous variables, mode for categorical variables, or more sophisticated methods like k-Nearest Neighbors (k-NN) imputation. For time series data (e.g., dengue cases), moving average rolling features can be used to fill gaps and capture temporal trends [44].
- Outlier Detection and Treatment: Outliers can arise from reporting errors or true epidemiological anomalies. Methods such as Z-score, IQR, or DBSCAN clustering are used to identify outliers. For critical features like mortality rates, outliers may be winsorized or capped to prevent model distortion.
- Data Normalization and Standardization: Features with different scales (e.g., temperature in °C vs. livestock density per km²) can bias gradient-based algorithms. Min-Max scaling (normalization to [37]) or Z-score standardization (mean=0, std=1) is applied. Tree-based models (Random Forest, XGBoost) are less sensitive to scaling but standardization is still recommended for SVM, k-NN, and neural networks.
- Addressing Class Imbalance: Outbreak events are rare compared to non-outbreak periods, leading to severe class imbalance. This is a critical challenge, as models can become biased towards the majority class (non-outbreak). Techniques employed include:
- Random Oversampling (ROS): Duplicating samples from the minority class.
- Synthetic Minority Oversampling Technique (SMOTE): Generating synthetic samples by interpolating between existing minority class instances.
- Random Undersampling: Removing samples from the majority class.
- Algorithmic Approaches: Using cost-sensitive learning or class weights in algorithms like XGBoost and Random Forest.
- A study on rabies epidemiology in low-surveillance settings found that XGBoost combined with ROS significantly enhanced model sensitivity for predicting rare rabies cases [11].
Feature Engineering and Selection
Feature engineering is the process of transforming raw data into informative features that improve model performance. This is arguably the most critical step in the pipeline.
- Temporal Features: Lag features (e.g., number of cases in the previous 1, 2, or 12 months) are essential for capturing disease seasonality and autocorrelation. For dengue prediction, lagged dengue cases and climatic variables (rainfall, temperature) with lags of 1-3 months are powerful predictors [19, 47]. Rolling window statistics (e.g., 4-week moving average of temperature) can smooth noise and reveal trends.
- Spatial Features: Geographic coordinates are used to derive features like distance to nearest farm, road, water body, or previous outbreak. Spatial autocorrelation metrics (e.g., Moran's I) can be calculated to quantify clustering. For Lumpy Skin Disease Virus, proximity to roads and cropland were significant predictors [7].
- Genomic Features:
- Sequence Composition: GC content, dinucleotide frequencies (e.g., CpG, UpA), and codon usage bias (e.g., effective number of codons, ENC) are powerful predictors of host range and adaptation [16, 28].
- k-mer Frequencies: The frequency of all possible nucleotide or amino acid substrings of length k (e.g., 3-mers, 5-mers) can be used as input features for models like SVM or neural networks. A 5-grams-transformer neural network achieved 98.01% F1-score for influenza virus host prediction [35].
- Position-Specific Scoring Matrices (PSSMs): Derived from multiple sequence alignments, PSSMs capture the evolutionary conservation of each position in a protein, providing rich information for predicting functional properties like receptor binding [35].
- Deep Mutational Scanning (DMS) Scores: These experimental data quantify the functional impact of every possible single amino acid mutation on traits like ACE2 binding or immune escape, providing a direct measure of viral fitness [26].
- Ecological Features: For vector-borne diseases, features like the "waterbird activity entropy" (WAE) index, which integrates species distribution and residency time, have proven highly predictive [5]. Species richness and abundance indices for key reservoir hosts (e.g., specific waterfowl taxa for AIV) are also valuable [24].
- Interaction Features: The product or ratio of two features can capture synergistic effects. For example, the interaction between temperature and humidity is critical for mosquito survival and viral replication.
Feature Selection is performed to reduce dimensionality, prevent overfitting, and improve model interpretability. Common methods include:
- Filter Methods: Correlation-based feature selection (CFS) [44], chi-squared test, and mutual information. These are computationally efficient and independent of the ML algorithm.
- Wrapper Methods: Recursive Feature Elimination (RFE) and forward/backward selection. These evaluate feature subsets based on model performance but are computationally expensive.
- Embedded Methods: Regularization techniques (L1/Lasso, L2/Ridge) and tree-based feature importance (e.g., from Random Forest or XGBoost) perform feature selection during model training. The SHAP (SHapley Additive exPlanations) method is increasingly used to provide a unified measure of feature importance and model interpretability [8, 20, 46].
Model Training, Validation, and Hyperparameter Optimization
The selection of the ML algorithm is guided by the nature of the prediction task (classification, regression, or clustering), data size, and interpretability requirements.
Algorithm Selection: A suite of algorithms is typically benchmarked. Commonly used models in veterinary outbreak prediction include:
- Tree-Based Ensembles: Random Forest (RF), Gradient Boosting Machines (GBM), XGBoost, LightGBM, and CatBoost. These are highly robust, handle non-linear relationships and missing data well, and provide feature importance. LightGBM achieved the best performance for predicting AIV outbreaks (F1-score 85%) [8] and for detecting pathogenic microorganisms in seafood (99.9% PPV) [2]. XGBoost has been used extensively for predicting SARS-CoV-2 shedding duration [46] and FMD risk [7].
- Support Vector Machines (SVM): Effective for high-dimensional data (e.g., genomic sequences) and for finding optimal separating hyperplanes. SVM with grid search optimization achieved 84% accuracy for virus host prediction [21].
- Deep Learning (DL): Convolutional Neural Networks (CNNs) are used for image-based diagnostics (e.g., LSDV detection from skin lesions [12], COVID-19 from chest X-rays [45]). Long Short-Term Memory (LSTM) networks are powerful for time series forecasting, such as predicting dengue cases [19] or modeling GCRV transmission dynamics [38]. Transformers and attention mechanisms (e.g., CBAM-DenseNet) are emerging for complex sequence and image analysis [12].
- Hybrid and Ensemble Models: Combining multiple models often yields superior performance. A consensus voting ensemble of LR, RF, KNN, and SVM stabilized predictions for PRRSV clinical impact [18]. Hybrid models combining deep feature extraction with ML classifiers (e.g., CNN + XGBoost) have shown excellent results for COVID-19 detection [45].
Training and Validation Strategy:
- Data Splitting: The dataset is split into training (e.g., 70-80%), validation (e.g., 10-15%), and independent test (e.g., 10-15%) sets. The test set must be held out completely until final model evaluation to avoid data leakage.
- Cross-Validation (CV): k-fold CV (typically 5-fold or 10-fold) is used on the training set to tune hyperparameters and assess model stability. Stratified CV is essential for imbalanced datasets to ensure each fold has a representative class distribution [8].
- Temporal Validation: For time-series data, standard CV can leak future information into the past. Walk-forward validation or time-series split is used, where the model is trained on past data and validated on subsequent, unseen time periods [2, 43].
- Hyperparameter Tuning: Grid search, random search, or Bayesian optimization (e.g., using Optuna) are employed to find the optimal set of hyperparameters (e.g., learning rate, tree depth, number of estimators, regularization parameters) that maximize a chosen metric (e.g., AUC-ROC, F1-score) on the validation set.
Evaluation Metrics: A comprehensive set of metrics is used to evaluate model performance, particularly given the class imbalance.
- For Classification: Accuracy, Precision, Recall (Sensitivity), Specificity, F1-score, Matthews Correlation Coefficient (MCC), and Area Under the Receiver Operating Characteristic Curve (AUC-ROC). The AUC-PR (Precision-Recall) is more informative than AUC-ROC for highly imbalanced datasets [15].
- For Regression: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R²).
- For Risk Stratification: The ability to correctly classify high-risk vs. low-risk areas is often assessed using sensitivity and specificity at a chosen probability threshold. Dynamic threshold optimization, as used in the seafood pathogen study, can further refine performance [2].
Addressing Data Leakage and Ensuring Reproducibility
Data leakage, where information from the future or the test set inadvertently influences the training process, is a critical pitfall. Strict protocols must be enforced:
- Temporal Leakage: All feature engineering (e.g., calculating rolling averages) must be performed within the training set only, using only past data. The test set must be processed independently.
- Spatial Leakage: When predicting risk across geographic regions, data from a specific location should not be used to predict the same location. Spatial cross-validation, where folds are created based on geographic clusters, can mitigate this.
- Genomic Leakage: When using sequence data, highly similar sequences (e.g., from the same outbreak) must be clustered (e.g., using CD-HIT) to ensure they are not split across training and test sets, which would inflate performance metrics [39, 40].
Finally, to ensure reproducibility, the entire pipeline-from data preprocessing scripts and feature engineering code to model architectures and hyperparameter configurations-must be version-controlled and documented. Public repositories (e.g., GitHub) and containerization (e.g., Docker) are recommended practices, as demonstrated by the public release of the AIV warning code [8]. The integration of these rigorous protocols ensures that the resulting predictive models are not only accurate but also robust, generalizable, and trustworthy for informing veterinary public health interventions.
Multimodal Data Fusion and Deep Learning Architectures for Early Warning Systems
The paradigm of outbreak prediction is undergoing a fundamental transformation, moving from univariate time-series analysis or single-modality risk factor modeling toward a more holistic and biologically informed framework: multimodal data fusion. In the context of veterinary viral outbreaks, the etiological agents do not operate in isolation. The emergence, transmission, and pathogenicity of viruses such as Avian Influenza Virus, African Swine Fever Virus, and Porcine Reproductive and Respiratory Syndrome Virus are governed by a complex interplay of viral genetics, host susceptibility, environmental conditions, management practices, and anthropogenic factors. Early warning systems that aspire to clinical utility must therefore integrate these disparate data streams into a coherent, predictive scaffold. Deep learning architectures, with their capacity to learn hierarchical representations from high-dimensional, heterogeneous data, are uniquely suited to this task. This section provides an exhaustive examination of the state-of-the-art in multimodal data fusion and the specific deep learning architectures that are enabling a new generation of early warning systems in veterinary virology.
The Imperative for Multimodal Integration
Traditional risk assessment models, often reliant on logistic regression or basic decision trees, are fundamentally limited by their inability to capture non-linear interactions across data modalities. For instance, predicting a Foot-and-Mouth Disease Virus incursion requires not only knowledge of livestock density but also seasonal climate patterns, viral serotype circulation, and proximity to major transport routes. The interpretable machine learning framework applied to FMD in the Middle East and North Africa (MENA) region explicitly demonstrated this complexity, revealing that sheep density, wind, temperature, and human population density acted as dominant, non-linear predictors for serotype O, while buffalo density and proximity to roads were more influential for serotype A [7]. This finding underscores that a single-modality view-looking only at host density, for instance-provides an incomplete and potentially misleading risk picture. Similarly, for Highly Pathogenic Avian Influenza (HPAI), ecological niche models that initially relied primarily on physical geography were significantly refined by the incorporation of wild bird ecology, including species richness, abundance of specific taxa, and behavioral traits affecting exposure risk [24]. This refinement, achieved through Bayesian additive regression trees (BART), moved risk projection from a static environmental map to a dynamic, ecologically informed landscape. The integration of waterbird activity entropy (WAE), a metric that captures the residency time and activity intensity of waterbird populations, further enhanced this capability, achieving an AUC of 0.87 for global HPAI risk prediction and identifying hotspots that cover 14% of global land area [5]. These examples illustrate a core principle: the signal for an impending outbreak is distributed across multiple, often weakly correlated, data sources, and a fusion architecture is required to extract it.
Architectures for Data Fusion: From Shallow to Deep Integration
The architectures employed for multimodal fusion in veterinary early warning systems range from early (data-level) fusion to intermediate (feature-level) and late (decision-level) fusion, with intermediate fusion increasingly favored for its ability to learn cross-modal representations.
1. Late Fusion (Ensemble and Voting Mechanisms): The simplest form of fusion involves training independent models on distinct data modalities and then combining their predictions. This approach is well-illustrated by the PRRSV clinical impact classification studies. In one analysis, baseline classifiers (logistic regression, random forest, k-nearest neighbor, support vector machine) were trained separately on ORF-5 nucleotide sequences, amino acid sequences, and demographic data [18]. A consensus voting ensemble then aggregated the predictions from these different input representations. While this approach stabilized performance metrics and made predictions more robust, it did not substantially improve overall diagnostic accuracy over the best baseline classifier, suggesting that late fusion may fail to capture the deep, non-linear interactions between genotype and phenotype that occur at the biological interface [18]. For PRRSV, the genetic input (ORF-5) carries information about viral virulence, while demographic data (herd size, parity distribution) reflects host vulnerability; the true risk signal lies in their interaction, which a late fusion model can only approximate. Despite these limitations, late fusion remains a practical and interpretable strategy for integrating diverse data sources, particularly when computational resources are constrained or when models are developed by different stakeholder groups.
2. Intermediate Fusion (Feature-Level Integration): More powerful architectures learn a joint representation from multiple data sources before making a prediction. This is the domain of true multimodal deep learning. For example, the LightGBM model that achieved superior performance in predicting Avian Influenza Virus outbreaks on Chinese laying farms was trained on a dataset comprising four categories and seventeen subcategories of data, including management factors (vaccination schedules, disinfection practices) that are often omitted from such models [8]. LightGBM, a gradient boosting framework that utilizes histogram-based algorithms, inherently performs a form of feature-level integration by learning optimal splits across all input dimensions. The Shapley additive explanation (SHAP) analysis revealed that management factors, particularly vaccination timing and biosecurity compliance, were among the top five most influential features, demonstrating that the model had learned a fused representation of virological, environmental, and operational risk. For African Swine Fever Virus risk mapping along the Korean Demilitarized Zone, the MaxEnt machine-learning algorithm was used to fuse climatic variables (precipitation, solar radiation), land use, and wild boar habitat data [42]. The model achieved an AUC of 0.92, but critically, it revealed that precipitation from spring to early summer and solar radiation in winter were the dominant predictors, while land use contributed little. This finding, only accessible through a fused modeling approach, has direct implications for surveillance resource allocation: understanding wild boar habitat preferences alone is insufficient for preventing ASF epidemics.
3. Cross-Modal Attention and Transformer-Based Architectures: The most recent and biologically sophisticated architectures leverage attention mechanisms, originally developed for natural language processing, to dynamically weigh the importance of different data modalities at each prediction step. Deep mutational scanning (DMS) data, which provides a comprehensive map of how mutations affect viral traits such as ACE2 binding affinity and immune escape, has been fused with socio-demographic indices (SDI) using XGBoost to predict SARS-CoV-2 lineage fitness [26]. The XGBoost model, which can be viewed as a form of additive tree-based attention, revealed that immune escape was the strongest predictor of viral fitness in the least developed countries (rate ratio = 1.90), but this effect declined as economic development increased. This cross-modal interaction-viral genetics × human demography-would be invisible to a model that did not fuse these data streams. For veterinary applications, a parallel architecture could fuse deep mutational scanning data from pathogens like Feline Influenza A Virus or Canine Parvovirus with host population immunity data and environmental transmission risk factors.
Biological Insights from Fused Data Streams: A Clinical Pathologist's Perspective
From a clinical pathology standpoint, the value of multimodal fusion lies not merely in improved accuracy but in the ability to generate biologically interpretable insights that inform intervention strategies. The integration of hematological, biochemical, and genomic data within a unified framework can reveal pathophysiological pathways that are not apparent from any single data type. Consider the prediction of prolonged SARS-CoV-2 shedding, which was achieved using an XGBoost model trained on ten features including vaccination status, hypertension, admission Ct value, and age [46]. In a veterinary context, a similar architecture could fuse viral load data (Ct values from RT-PCR for Canine Distemper Virus), host inflammatory biomarkers (serum amyloid A, haptoglobin), and host genetic markers (MHC haplotype) to predict which individuals will become chronic shedders or develop severe neurologic disease. The model's reliance on admission Ct value and comorbidities mirrors the clinical reality that viral clearance is a function of both viral inoculum and host immunocompetence, a relationship that is inherently multimodal.
The fusion of label-free spectroscopic data with deep learning represents a particularly exciting frontier for point-of-care diagnostics. The SERS-MultiplexCR platform, which integrates surface-enhanced Raman scattering (SERS) spectra from over 1.2 million data points with a deep learning model, achieved 98.6% accuracy in classifying eleven respiratory viruses in co-infection scenarios and provided concentration estimates with a mean absolute error of 0.028 [30]. For veterinary applications, this platform could be adapted to simultaneously detect and quantify Feline Calicivirus, Bovine Respiratory Syncytial Virus, and Equine Influenza A Virus from a single nasal swab, providing both species-level identification and viral load data that could guide treatment and biosecurity decisions. The ACE2-based SERS sensor enhanced by the CoVari deep learning algorithm further demonstrates the power of this approach for variant discrimination, achieving 99.9% accuracy in differentiating SARS-CoV-2 variants [31]. For veterinary virology, a similar sensor functionalized with host receptor orthologs (e.g., feline aminopeptidase N for Feline Coronavirus and FIP) could provide real-time surveillance for emerging variants that exhibit altered host tropism.
Temporal and Spatial Fusion: The Role of Recurrent, Convolutional, and Graph Neural Networks
Early warning systems are inherently temporal and spatial problems. Outbreaks do not occur randomly but propagate through time and across geographic landscapes. Deep learning architectures must therefore capture both temporal dependencies and spatial correlations. Recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, have been extensively applied to time-series prediction of vector-borne diseases. In the context of Dengue Virus forecasting, an LSTM model trained on ten years of meteorological and epidemiological data from Gujarat, India, achieved an R² of 0.84, with previously reported dengue cases, population density, and monthly rainfall emerging as the most effective predictors [19]. The LSTM's capacity to learn from lags of one, two, and twelve months allowed it to capture both short-term epidemic dynamics and long-term seasonal trends. For veterinary applications, an LSTM-based architecture could fuse time-series data on vector abundance (e.g., Culicoides midge trap counts), meteorological variables, and livestock seroprevalence to forecast Bluetongue Virus incursions with a lead time sufficient for vaccination.
The fusion of temporal and spatial data is best achieved through hybrid architectures. The epidemic mathematical model SEIDR (Susceptible-Exposed-Infected-Dead-Removed), combined with an LSTM algorithm for parameter estimation, was used to simulate the transmission dynamics of Grass Carp Reovirus (GCRV) in rare minnow populations under varying temperature and density conditions [38]. The hybrid approach revealed that the basic reproduction number (R₀) was less than 1 at 21°C, indicating disease fade-out, but increased with temperature and density. This finding, which integrates mechanistic modeling with deep learning-based parameter inference, provides a quantitative framework for predicting outbreak risk under different environmental scenarios. For aquaculture, this architecture could be extended to predict outbreaks of Infectious Salmon Anemia Virus or White Spot Syndrome Virus by fusing water temperature, salinity, and stocking density data with viral genetic surveillance.
Graph neural networks (GNNs) represent a powerful emerging architecture for modeling the spatial spread of infectious diseases. By representing farms, markets, and wildlife habitats as nodes in a graph, with edges representing animal movements, trade flows, or vector flight paths, GNNs can learn to propagate risk signals through the network. For Foot-and-Mouth Disease Virus, where livestock movement is a primary driver of long-distance spread, a GNN could fuse animal movement data (from ear-tag or passport systems), vaccination coverage, and environmental risk factors to predict the most likely pathways of incursion. The ecological niche model that successfully predicted FMD risk in South Asia, incorporating livestock densities, land use, and climatic variables, achieved accuracies exceeding 87% using a multi-algorithm ensemble [33]. A GNN-based extension of this work could provide fine-grained, node-level predictions that enable targeted interventions such as pre-emptive vaccination or movement restrictions at specific high-risk farms.
Addressing Data Heterogeneity and Missingness: The Role of Self-Supervised and Contrastive Learning
A major practical challenge in multimodal fusion is the heterogeneity of data: some farms may have extensive genomic surveillance but limited environmental data, while others may have continuous temperature logging but no genetic typing. Deep learning architectures must be robust to missing modalities. Self-supervised learning, in which a model is pre-trained on a large corpus of unlabeled data and then fine-tuned on a smaller labeled dataset, offers a promising solution. For Porcine Reproductive and Respiratory Syndrome Virus, where open reading frame 5 (ORF-5) sequence data is widely available but often not linked to detailed clinical outcome data, a self-supervised model could be pre-trained to learn a representation of the ORF-5 sequence space, and then fine-tuned on the smaller subset of farms where clinical impact data are available. This approach could overcome the limitation identified in one PRRSV study, where even extreme boosting machine learning models achieved only 60.8% accuracy for predicting abortion impact and 74.4% for pre-weaning mortality [9]. The study noted that baseline classifiers showed good performance with moderate variance, but that overall accuracy was insufficient for field implementation. A self-supervised pre-training strategy could improve the signal-to-noise ratio by learning a more robust embedding of the genetic data before fusion with demographic features.
Contrastive learning, a specific form of self-supervised learning, trains models to bring representations of similar data points closer together while pushing representations of dissimilar data points apart. For early warning systems, a contrastive learning framework could learn a joint embedding space for multimodal data from outbreak and non-outbreak scenarios, such that new, unseen data can be rapidly classified based on its proximity to known outbreak embeddings in the latent space. This approach is particularly valuable for emerging pathogens, such as Nipah Virus in Pigs, where historical outbreak data are sparse. By fusing bat distribution data, palm sap harvesting practices, and pig population density, a contrastive learning model could identify pre-outbreak signatures even in the absence of a large labeled training set.
Sensor Fusion: From Wearables to Environmental Monitoring
The proliferation of low-cost sensors in veterinary practice is generating a wealth of continuous physiological and environmental data that must be fused within intelligent architectures. Wearable biosensors that capture heart rate, respiratory rate, body temperature, and movement patterns, when combined with environmental sensors that monitor temperature, humidity, and ammonia levels, provide a real-time window into animal health [17]. Machine learning algorithms, including SVM, Random Forest, XGBoost, and LSTM, are being deployed to detect anomalies in these multimodal streams and predict health risks before clinical symptoms appear [17]. For livestock operations, this approach could provide early warning for Bovine Respiratory Disease Complex, where multiple pathogens such as Bovine Coronavirus and Mannheimia haemolytica interact with environmental stressors to produce clinical disease. The fusion of heart rate variability (HRV) features from wearables with inflammatory biomarker data has already demonstrated 70% sensitivity and 77% specificity for predicting viral respiratory tract infections in humans [32], and similar architectures are directly translatable to veterinary species.
The integration of ultraviolet germicidal irradiation (UVGI) and mechanical ventilation data for Avian Influenza Virus control represents a unique challenge and opportunity for sensor fusion in intensive poultry operations [48]. While tunnel and hybrid ventilation systems are associated with improved air quality and lower modeled infection risks, poultry houses present fundamentally different challenges than human buildings: elevated bioaerosol loads, high population density, and proximity between animals [48]. A multimodal early warning system for HPAI in poultry barns would need to fuse real-time data on ventilation rates, UVGI dosing (with reported inactivation rate constants for influenza ranging from 0.003 to 2.18 m²/J), temperature and humidity, and perhaps even audio signals for cough detection, within a deep learning framework that can learn the complex, non-linear relationships between these environmental parameters and viral transmission risk.
The Path Forward: Integrating Genomic, Clinical, and Environmental Modalities
The ultimate goal of multimodal data fusion in veterinary early warning systems is to create a unified representation of the host-pathogen-environment axis that can be interrogated in real time. The success of this endeavor depends not only on algorithmic innovation but on the systematic collection and standardization of data across the veterinary ecosystem. The WOAH (World Organisation for Animal Health) and FAO have called for enhanced surveillance systems that integrate laboratory, clinical, and epidemiological data. Deep learning architectures provide the computational engine to realize this vision, but they require high-quality, labeled, and interoperable data. The CRISP-Biosensor-AI platform concept, which integrates CRISPR-based molecular detection with environmental sensing and AI-driven learning loops, offers a prot
Challenges and Future Directions in Veterinary Viral Outbreak Prediction
The deployment of machine learning (ML) for predicting veterinary viral outbreaks has progressed from theoretical construct to operational reality, yet the translation of algorithmic promise into field-level impact remains fraught with profound biological, logistical, and epistemological challenges. As a veterinary clinical pathologist who has witnessed the diagnostic chaos of emerging outbreaks-from the peracute mortality of African Swine Fever Virus in naive swine herds to the insidious respiratory ravages of Porcine Reproductive and Respiratory Syndrome Virus-I recognize that the path forward demands not merely better algorithms, but a fundamental reconceptualization of how we integrate pathogen biology with data science. The following analysis dissects the critical obstacles and charts the transformative frontiers that will define the next generation of predictive veterinary virology.
Data Heterogeneity, Surveillance Bias, and the Curation Crisis
The most pervasive challenge confronting ML-based outbreak prediction is the profound heterogeneity and systematic bias inherent in veterinary surveillance data. Unlike human clinical datasets, which benefit from centralized reporting infrastructure, veterinary data are fragmented across diagnostic laboratories, government agencies, production company records, and academic research projects, each with distinct collection protocols, case definitions, and diagnostic sensitivities. This fragmentation creates a critical problem: models trained on high-quality, well-characterized outbreak data from intensively managed livestock operations in high-income countries perform abysmally when applied to the smallholder, backyard, or pastoralist production systems that characterize much of the global livestock population. The literature repeatedly emphasizes that data quality issues-including missing values, inconsistent recording, and lack of standardized reporting-represent the single greatest impediment to model generalizability [1, 3, 49, 55, 57]. For pathogens like Avian Influenza Virus, where surveillance bias is particularly acute, the most robust ML frameworks risk becoming parochial tools that reflect the ecologies of Europe and North America while failing to capture the transmission dynamics of sub-Saharan Africa or South Asia, where reporting gaps are staggering [5, 24, 42].
This problem is compounded by the extreme class imbalance that characterizes outbreak prediction. Veterinary viral outbreaks are, thankfully, rare events relative to the denominator of susceptible populations. Models trained on imbalanced datasets-where "no outbreak" vastly outnumbers "outbreak"-tend to achieve high overall accuracy by simply predicting the majority class, a phenomenon that renders them clinically useless. While techniques such as synthetic minority oversampling (SMOTE) and random oversampling have been employed to mitigate this [11], their application to veterinary data remains problematic because the artificial generation of outbreak features may not preserve the underlying epidemiological structure. For Rabies Lyssavirus surveillance in low-resource settings, oversampling strategies improved model sensitivity but could not overcome the fundamental limitation that most animal rabies investigations end without laboratory confirmation, leaving models to learn from clinically diagnosed rather than virologically confirmed cases [11].
Temporal and Spatial Generalizability: The Achilles' Heel of Static Models
A second, equally formidable challenge is the temporal non-stationarity of viral evolution and ecological drivers. ML models are typically trained on historical data and assume that the statistical relationships learned will persist into the future. Yet veterinary viruses defy this assumption with alarming regularity. The emergence of new Porcine Reproductive and Respiratory Syndrome Virus lineages, antigenic drift in Foot-and-Mouth Disease Virus, and the rapid global dissemination of highly pathogenic H5N1 clade 2.3.4.4b all demonstrate that viral behavior shifts over time, often unpredictably [9, 22, 24, 36]. Models that fail to incorporate mechanisms for detecting and adapting to these shifts-such as online learning or periodic retraining on new data-will inevitably degrade in performance. This was graphically illustrated by attempts to predict clinical impact of PRRSV outbreaks in Ontario sow herds using open reading frame-5 sequences, where even sophisticated extreme boosting machines achieved only modest accuracy (60.8% for abortion impact) and failed to reach the threshold for field implementation [9]. Similarly, ecological niche models for Foot-and-Mouth Disease Virus that demonstrated high cross-validation accuracy (>0.87) for South Asia processed data spanning 2009-2022 but could not guarantee predictive performance under scenarios of climate change, land-use alteration, or shifting livestock trade patterns [7, 33].
Spatial generalizability is equally problematic. Models fitted to one geographic region incorporate region-specific relationships between environmental drivers and outbreak risk that may not hold elsewhere. The distribution of Bluetongue Virus and African Horse Sickness Virus vectors in South Africa, modeled using random forest with 75% variance explained, was heavily dependent on local variables such as cattle density and water vapor pressure [14]. Applying these same models to regions with different vector species assemblages, livestock management practices, or climatic regimes would likely produce unreliable predictions. The fundamental assumption of ecological niche modeling-that species distributions are shaped by environmental constraints that can be generalized-often founders on the hard reality of local adaptation and historical contingency.
Model Interpretability and the Trust Deficit in Clinical Veterinary Practice
For ML models to influence real-world outbreak control decisions-such as movement restrictions, vaccination campaigns, or pre-emptive culling-they must be trusted by veterinary practitioners, producers, and regulatory authorities. However, the "black box" nature of many high-performing algorithms (deep neural networks, gradient boosting machines, ensemble methods) creates a significant trust deficit. A veterinarian asked to recommend depopulation of a swine herd based on an XGBoost score they cannot explain will-and should-hesitate. The clinician's mandate is to understand the why behind a prediction, not merely the what. This has driven growing interest in explainable AI (XAI) techniques such as SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME), which provide feature-level attribution for individual predictions [1, 4, 7, 8, 20].
The application of SHAP to predict Avian Influenza Virus outbreaks in Chinese laying farms identified vaccination status and disinfection frequency as the most influential features, providing actionable insights for farm managers [8]. Similarly, SHAP analysis of meningitis classification models highlighted leukocyte count and petechiae presence as driving predictors, offering clinical face validity [20]. Yet these interpretability methods have limitations: SHAP values can be computationally expensive, unstable with correlated features, and may not fully capture the non-linear interaction effects that characterize complex epidemiological systems. For Lumpy Skin Disease Virus, where deep learning models achieved 99% accuracy on augmented datasets, the clinical utility of a prediction remains contingent on the clinician's ability to verify that the model has learned biologically plausible features-lesion morphology, vector exposure history-rather than spurious dataset artifacts [12]. The path forward requires not only algorithmic improvements but also a cultural shift in veterinary education, training practitioners to critically evaluate ML outputs rather than accept them as immutable oracles.
Emerging Frontiers: Genomic Integration, Real-Time Sensing, and Autonomous Adaptation
Despite these formidable challenges, the future of veterinary viral outbreak prediction is remarkably bright, driven by three converging revolutions: genomic data deluge, ubiquitous biosensing, and the maturation of adaptive learning algorithms. The first frontier involves the direct integration of viral genomic sequence data into predictive frameworks. Rather than treating outbreak risk as a function of static environmental or demographic variables, next-generation models can now incorporate the intrinsic evolutionary potential of the pathogen itself. The concept of using compositional biases in viral genomes-dinucleotide frequencies, codon usage patterns-to predict host range has been validated across diverse RNA viruses, with random forest classifiers achieving ~73% accuracy in predicting animal hosts from spike protein sequences alone [16, 28, 59]. This approach has been extended to identify potential reservoir hosts for filoviruses by integrating receptor-binding assays with ML, creating a pipeline that predicts susceptible bat species based on Niemann-Pick C1 sequence variation [34]. For SARS-CoV-2, deep mutational scanning data combined with socio-demographic indices enabled prediction of lineage-specific fitness via XGBoost with R² values of 0.786, demonstrating that viral traits (ACE2 binding, immune escape) and host population context jointly shape outbreak trajectories [26].
The second frontier is the revolution in real-time, continuous data streams from wearable biosensors and environmental monitoring. Traditional outbreak surveillance relies on retrospective case reporting, introducing weeks of delay between infection onset and detection. Machine learning models trained on physiological data from sensor-equipped livestock-body temperature, heart rate variability, respiratory rate, movement patterns-can detect anomalies 48-72 hours before clinical signs emerge, transforming outbreak prediction from reactive to anticipatory [3, 17]. For the Infectious Salmon Anemia Virus in aquaculture, where environmental parameters (temperature, dissolved oxygen) profoundly influence disease expression, sensor-integrated models could predict impending outbreaks with sufficient lead time to modify feeding regimes or initiate harvest. The combination of CRISPR-based molecular detection with AI-driven interpretive biosensing, as proposed in "reflective diagnostics," could generate composite risk scores that integrate molecular signals with environmental context (temperature, humidity, time) and continuously update decision rules through learning loops [23]. Such platforms, applied to White Spot Syndrome Virus in shrimp ponds or Koi Herpesvirus in ornamental fish facilities, would represent a paradigm shift from batch testing to continuous risk intelligence.
The third frontier involves moving beyond static classification to autonomous, adaptive prediction systems capable of learning from emerging data without catastrophic forgetting. Reinforcement learning frameworks, which optimize decision-making through trial-and-error interaction with an environment, offer a tantalizing avenue for dynamically adjusting biosecurity interventions in response to evolving outbreak risk [57]. For Porcine Reproductive and Respiratory Syndrome Virus, where biosecurity practices for control differ substantially from those required for eradication [4], an adaptive ML system could learn-in real time-which interventions (semen management, replacement gilt quarantine, barn climate control) are most effective given the current circulating strain, herd immunity profile, and environmental conditions. Similarly, spatial models of vector-borne diseases could incorporate dynamic vector surveillance data (e.g., from citizen science programs or automated light traps) and update risk maps weekly, enabling targeted vector control in regions where transmission risk is escalating [50-52].
The One Health Imperative and the Path to Equitable Deployment
Ultimately, the greatest challenge-and the most urgent future direction-is ensuring that ML-based outbreak prediction serves the global veterinary community equitably. The current landscape is characterized by profound disparities in data infrastructure, computational capacity, and analytical expertise. High-income countries with centralized diagnostic networks, robust electronic recording systems, and access to bioinformaticians are positioned to reap the benefits of predictive modeling. Low- and middle-income countries, where the burden of zoonotic emerging infectious diseases is often highest and surveillance gaps most critical, face a perfect storm of data poverty, limited internet connectivity, and insufficient skilled personnel [3, 5, 11, 53, 58, 60]. International organizations-the World Organisation for Animal Health (WOAH), the Food and Agriculture Organization (FAO), the World Health Organization (WHO)-have recognized this challenge, but concrete progress toward open-source, low-bandwidth, locally-adaptable ML tools remains insufficient.
The future of veterinary viral outbreak prediction must therefore be built on principles of open science, modular modularity, and capacity building. Federated learning approaches, where models are trained across distributed datasets without centralizing sensitive data, could enable collaborative model development across national boundaries while respecting data sovereignty. Pre-trained, transferable models that can be fine-tuned on small local datasets-for instance, adapting a base model for Foot-and-Mouth Disease Virus risk in South Asia to the specific ecology of West Africa [33]-would dramatically reduce the data requirements for low-resource settings. Investments in field-deployable diagnostic technologies (e.g., portable qPCR, CRISPR-based sensors) that generate standardized, machine-readable data must proceed in lockstep with ML algorithm development, ensuring that data generation and data analysis advance together rather than in isolation [54, 56].
The challenges are daunting: data heterogeneity, temporal non-stationarity, interpretability deficits, infrastructural inequity. Yet the convergence of genomic virology, sensor technology, and adaptive machine learning offers an unprecedented opportunity to transform veterinary outbreak prediction from a retrospective accounting of losses into a prospective, actionable science. As a pathologist who has spent decades watching viruses outrun our diagnostic capacity, I am convinced that the models we build today-if built rigorously, transparently, and equitably-will determine whether we spend the next pandemic cycle chasing pathogens or anticipating them.
References
Li C, Wang X, Liu H, Zhu D. Applications and advances of deep learning in food microbiology: From data to intelligent analysis . Trends in Food Science & Technology. 2026. DOI: https://doi.org/10.1016/j.tifs.2026.105687
Mi W, Kong Y, Ma Y, Gong C, Huo Y, Feng X, et al.. Machine learning-based modeling for monitoring and predicting the detection rate and severity of pathogenic microorganisms in seafood . International Journal of Food Microbiology. 2026. DOI: https://doi.org/10.1016/j.ijfoodmicro.2026.111732
Ehsanullah , Maqbool B, Arshad M, Abourashed N, Gul S. Role of artificial intelligence in veterinary anatomical diagnostics and zoonotic disease monitoring . Annals of Anatomy - Anatomischer Anzeiger. 2026. DOI: https://doi.org/10.1016/j.aanat.2025.152756
Akiyama S, Sasaki Y. Biosecurity practices useful for porcine reproductive and respiratory syndrome control and eradication on commercial swine farms using machine learning models. Preventive Veterinary Medicine. 2026. DOI: https://doi.org/10.1016/j.prevetmed.2025.106764
Li Y, Qiao Y, Zhan Y, Dong J, Toor MVv, Waldenström J, et al.. Mapping global avian influenza risk patterns through waterbird activity entropy. Nature Communications. 2026. DOI: https://doi.org/10.1038/s41467-026-70432-0
Lagergren JH, Ruiz‐Aravena M, Becker DJ, Madden W, Ruytenberg L, Hoegh A, et al.. Environmental and ecological signals predict food shortages for subtropical populations of Australian flying foxes, reservoirs of Hendra virus.. Biology Letters. 2025. DOI: https://doi.org/10.1098/rsbl.2025.0371
Alkhamis MA, Abouelhassan H, Alateeqi A, Husain A, Humphreys JM, Arzt J, et al.. Predicting the Landscape Epidemiology of Foot-and-Mouth Disease in Endemic Regions: An Interpretable Machine Learning Approach. Viruses. 2025. DOI: https://doi.org/10.3390/v17101383
Zhuang Y, Liu Y, Li Q, Zhang H, Wang C, Ding L, et al.. An intelligence framework for the early warning of avian influenza in Chinese laying farms. Poultry Science. 2025. DOI: https://doi.org/10.1016/j.psj.2025.105822
Melmer D, O’Sullivan T, Greer A, Ojkić D, Friendship R, Poljak Z. Machine learning models provide modest accuracy in predicting clinical impact of porcine reproductive and respiratory syndrome type 2 in Canadian sow herds.. American Journal of Veterinary Research. 2025. DOI: https://doi.org/10.2460/ajvr.24.10.0289
Megahed A, Bommineni R, Short M, Galvão K, Bittar J. Using supervised machine learning algorithms to predict bovine leukemia virus seropositivity in dairy cattle in Florida: A 10-year retrospective study.. Preventive Veterinary Medicine. 2024. DOI: https://doi.org/10.1016/j.prevetmed.2024.106387
Keshavamurthy R, Boutelle C, Nakazawa Y, Joseph HC, Joseph DW, Dilius P, et al.. Machine learning to improve the understanding of rabies epidemiology in low surveillance settings. Scientific Reports. 2024. DOI: https://doi.org/10.1038/s41598-024-76089-3
Mujahid M, Khurshaid T, Safran MS, Alfarhood S, Ashraf I. Prediction of lumpy skin disease virus using customized CBAM-DenseNet-attention model. BMC Infectious Diseases. 2024. DOI: https://doi.org/10.1186/s12879-024-10032-9
Alberts F, Berke O, Maboni G, Petukhova T, Poljak Z. Utilizing machine learning and hemagglutinin sequences to identify likely hosts of influenza H3Nx viruses.. Preventive Veterinary Medicine. 2024. DOI: https://doi.org/10.1016/j.prevetmed.2024.106351
Klerk Jd, Tildesley M, Labuschagne K, Gorsich EE. Modelling bluetongue and African horse sickness vector (Culicoides spp.) distribution in the Western Cape in South Africa using random forest machine learning. Parasites & Vectors. 2024. DOI: https://doi.org/10.1186/s13071-024-06446-8
Ru B, Kujawski SA, Afanador NL, Baumgartner R, Pawaskar M, Das AK. Predicting Measles Outbreaks in the United States: Evaluation of Machine Learning Approaches. JMIR Formative Research. 2022. DOI: https://doi.org/10.2196/42832
Brierley L, Fowler A. Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning. bioRxiv. 2020. DOI: https://doi.org/10.1371/journal.ppat.1009149
Khan MTR, Arunkumar A, Prasanna KL, M.Devaki, M V, A A. Machine Learning for the Early Diagnosis of Livestock Diseases via Wearable Biosensors and Environmental Monitoring. 2025 1st International Conference on Advancement in Futuristic Technologies (ICAFT). 2025. DOI: https://doi.org/10.1109/ICAFT66710.2025.11453220
Chadha A, Dara R, Pearl D, Gillis D, Rosendal T, Poljak Z. Classification of porcine reproductive and respiratory syndrome clinical impact in Ontario sow herds using machine learning approaches. Frontiers in Veterinary Science. 2023. DOI: https://doi.org/10.3389/fvets.2023.1175569
Mehta A, Patel KS. LSTM-based Forecasting of Dengue Cases in Gujarat: A Machine Learning Approach. Indian Journal of Science and Technology. 2024. DOI: https://doi.org/10.17485/ijst/v17i7.2748
Victor A, Santos DAM, Nery EK, Mori DP, Lucas PCdC, Cammarota D, et al.. Improving meningitis surveillance and diagnosis with machine learning: Insights from São Paulo. PLOS Digital Health. 2025. DOI: https://doi.org/10.1371/journal.pdig.0000925
Purwono P, Nabila A, Wulandari E, Hardeani N, Purwono. Virus Host Prediction with Metagenomic Features using Support Vector Machine Algorithm and Grid Search Cross Validation Optimization. Journal of Advanced Health Informatics Research. 2024. DOI: https://doi.org/10.59247/jahir.v2i3.298
Hoffer Ad, Vatani S, Cot C, Cacciapaglia G, Chiusano M, Cimarelli A, et al.. Variant-driven early warning via unsupervised machine learning analysis of spike protein mutations for COVID-19. Scientific Reports. 2022. DOI: https://doi.org/10.1038/s41598-022-12442-8
Liberty J. Reflective diagnostics: A self-learning CRISPR-biosensor-AI platform for adaptive food and health safety monitoring . Food and Humanity. 2026. DOI: https://doi.org/10.1016/j.foohum.2026.101005
Hayes S, Hilton J, Mould-Quevedo J, Donnelly C, Baylis M, Brierley L. Ecology and environment predict spatially stratified risk of H5 highly pathogenic avian influenza clade 2.3.4.4b in wild birds across Europe. bioRxiv. 2025. DOI: https://doi.org/10.1038/s41598-025-30651-9
Frank J, Gan E, Hooper WB, Ott I, Iwasaki A. Systematic multi-reference vertebrate ACE2 sequence similarity analysis predicts species susceptibility to SARS-related sarbecoviruses. Scientific Reports. 2026. DOI: https://doi.org/10.1038/s41598-026-41410-9
Ding Z, Yuan H. Viral traits from deep mutational scanning and socio-demographic context predict SARS-CoV-2 lineage fitness across diverse countries.. International Journal of Infectious Diseases. 2025. DOI: https://doi.org/10.1016/j.ijid.2025.108260
Ballesteros-Ricaurte J, Fabregat R, Carrillo-Ramos A, Parra C, Moreno A. Artificial neural networks to predict the presence of Neosporosis in cattle.. Mathematical biosciences and engineering : MBE. 2025. DOI: https://doi.org/10.3934/mbe.2025041
Babayan S, Orton R, Streicker D. Predicting Reservoir Hosts and Arthropod Vectors from Evolutionary Signatures in RNA Virus Genomes. Science. 2018. DOI: https://doi.org/10.1126/science.aap9072
Megahed A, Bommineni Y, Short M, Bittar J. Using Supervised Machine Learning Algorithms to Predict Bovine Leukemia Virus Seropositivity in Florida Beef Cattle: A 10‐Year Retrospective Study. Journal of Veterinary Internal Medicine. 2025. DOI: https://doi.org/10.1111/jvim.70070
Yang Y, Cui J, Kumar A, Luo D, Murray J, Jones L, et al.. Multiplex Detection and Quantification of Virus Co-Infections Using Label-free Surface-Enhanced Raman Spectroscopy and Deep Learning Algorithms. ACS Sensors. 2025. DOI: https://doi.org/10.1021/acssensors.4c03209
Yang Y, Cui J, Luo D, Murray J, Chen X, Hülck S, et al.. Rapid Detection of SARS-CoV-2 Variants Using an Angiotensin-Converting Enzyme 2-Based Surface-Enhanced Raman Spectroscopy Sensor Enhanced by CoVari Deep Learning Algorithms. ACS Sensors. 2024. DOI: https://doi.org/10.1021/acssensors.4c00488
Jlassi O, Hadid A, McDonald EG, Ding Q, Phillipp C, Trottier A, et al.. Predicting viral respiratory tract infections using wearable biosensor monitoring during 3-minute constant rate stair stepping tests. Frontiers in Sensors. 2026. DOI: https://doi.org/10.3389/fsens.2026.1736869
Gunasekera U, Alkhamis MA, Puvanendiran S, Das M, Kumarawadu P, Sultana M, et al.. Ecological niche modeling for surveillance of foot-and-mouth disease in South Asia. PLoS ONE. 2025. DOI: https://doi.org/10.1371/journal.pone.0320921
Lasso G, Grodus MG, Valencia E, DeJesus VA, Liang E, Delwel I, et al.. Decoding the blueprint of receptor binding by filoviruses through large-scale binding assays and machine learning.. Cell Host and Microbe. 2025. DOI: https://doi.org/10.1016/j.chom.2024.12.016
Xu Y, Wojtczak D. Dive into Machine Learning Algorithms for Influenza Virus Host Prediction with Hemagglutinin Sequences. Biosyst.. 2022. DOI: https://doi.org/10.48550/arXiv.2207.13842
Hoffer Ad, Vatani S, Cot C, Cacciapaglia G, Chiusano M, Cimarelli A, et al.. Predicting Variant-Driven Multi-Wave Pattern of COVID-19 via a Machine Learning Analysis of Spike Protein Mutations. . 2021. DOI: https://doi.org/10.21203/rs.3.rs-1110029/v1
Hassen B. Antibiotic resistance in aquaculture: Contributions and perspectives of genomics. Aquaculture and Fisheries. 2026. DOI: https://doi.org/10.1016/j.aaf.2026.01.007
Pan J, Liu Z, Dai Y, Qin W, Tan Y, Huang L, et al.. Experimental investigation of the dynamic grass carp reovirus transmission by an epidemic mathematical model . Aquaculture. 2026. DOI: https://doi.org/10.1016/j.aquaculture.2025.743575
Izhari M, Gosady ARA, Alghamdi F, Alghamdi WA, Hadadi MAA, Almontasheri AHA, et al.. Rational design and in silico characterization of a multiepitope mRNA vaccine candidate against human metapneumovirus (hMPV) using reverse vaccinology and immunoinformatics approaches. Scientific Reports. 2025. DOI: https://doi.org/10.1038/s41598-025-29906-2
Zanon IP, Amarante VS, Kenney SM, Silva T, Castro YGd, Campos JVF, et al.. Salmonella Dublin outbreaks in Brazilian cattle: clinical-epidemiological aspects, antimicrobial resistance, and comparative genomic analysis. Microbiology spectrum. 2026. DOI: https://doi.org/10.1128/spectrum.02665-25
Fang R, Wang L, Yang X, Guo Y, Peng B, Zhang Y, et al.. Predictive Modeling of Global SARS-CoV-2 Infection Risk in Animals: Unveiling Potential Reservoirs and Informing Health Policy Synergies. Transboundary and Emerging Diseases. 2025. DOI: https://doi.org/10.1155/tbed/3959370
Im C, Curtis A, Song D, Kim OS. Mapping African Swine Fever and Highly Pathogenic Avian Influenza Outbreaks along the Demilitarized Zone in the Korean Peninsula. Transboundary and Emerging Diseases. 2024. DOI: https://doi.org/10.1155/2024/8824971
Garcia-Vozmediano A, Maurella C, Ceballos L, Crescio E, Meo R, Martelli W, et al.. Machine learning approach as an early warning system to prevent foodborne Salmonella outbreaks in northwestern Italy. Veterinary Research. 2024. DOI: https://doi.org/10.1186/s13567-024-01323-9
Bhunia S, Abirami D. Correlation based Feature Selection and Hybrid Machine Learning Approach for Forecasting Disease Outbreaks. International Conference Electronic Systems, Signal Processing and Computing Technologies [ICESC-]. 2023. DOI: https://doi.org/10.1109/ICESC57686.2023.10193045
Qaid TS, Mazaar H, Al-Shamri MYH, Alqahtani M, Raweh AA, Alakwaa W. Hybrid Deep-Learning and Machine-Learning Models for Predicting COVID-19. Computational Intelligence and Neuroscience. 2021. DOI: https://doi.org/10.1155/2021/9996737
Zhang Y, Li Q, Duan H, Tan L, Cao Y, Chen J. Machine learning based predictive modeling and risk factors for prolonged SARS-CoV-2 shedding. Journal of Translational Medicine. 2024. DOI: https://doi.org/10.1186/s12967-024-05872-7
Mazhar B, Ali N, Manzoor F, Khan MK, Nasir M, Ramzan M. Development of Data-driven Machine Learning Models and their Potential Role in Predicting Dengue outbreak.. Journal of Vector Borne Diseases. 2024. DOI: https://doi.org/10.4103/0972-9062.393976
Nunayon S, Glover K, Xu M, Zhong L. Ultraviolet germicidal irradiation and ventilation for avian influenza control in poultry farms: A comprehensive review. Journal of Hazardous Materials. 2026. DOI: https://doi.org/10.1016/j.jhazmat.2026.141454
Er A, Gülten E, Temoçin F, Aktuğ-Demir N. Artificial Intelligence in Infectious Diseases and Clinical Microbiology: Current and Future Directions. Klimik Dergisi/Klimik Journal. 2026. DOI: https://doi.org/10.36519/kd.2026.5400
Cholvi M, Moretti R, Osório H, L'Ambert G, Seixas G, Kavran M, et al.. Present and Future of Mosquito-Borne Disease Control in Europe with a Specific Focus on the Mediterranean. Insects. 2026. DOI: https://doi.org/10.3390/insects17030254
Yang C, Futami K, Nihei N, Fujita R, Ogino K, Hirabayashi K, et al.. Tiger prowling: Distribution modelling for northward-expanding Aedes albopictus (Diptera: Culicidae) in Japan. PLoS ONE. 2024. DOI: https://doi.org/10.1371/journal.pone.0303137
Gottipati A, Iragavarapu S. Forecasting the Spread of Dengue Outbreaks with a Synthesis of Machine Learning Models Utilizing Exogenous Variables. Industrial and Systems Engineering Review. 2025. DOI: https://doi.org/10.37266/iser.2025v12i1.pp13-28
R.O O, A.R U, U.E I. AI- Driven Prediction of Lassa Fever Using Evolutional and Random Forest: A Machine Learning Approach for Enhanced Surveillance in West Africa. International Journal of Data Science. 2025. DOI: https://doi.org/10.18517/ijods.6.1.40-53.2025
Metsky HC, Freije CA, Kosoko-Thoroddsen TF, Sabeti PC, Myhrvold C. CRISPR-based surveillance for COVID-19 using genomically-comprehensive machine learning design. bioRxiv. 2020. DOI: https://doi.org/10.1101/2020.02.26.967026
V PH, Prakash DHN. The Prospective Of Artificial Intelligence In Veterinary Care. INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT. 2025. DOI: https://doi.org/10.55041/ijsrem51805
Hussain A, Abbas Q, Nadeem M, Nazar A, Athar A, Rahman H. Meat-Borne Bacterial Pathogen Detection: Conventional, Molecular and Emerging AI-Based Strategies. Diagnostics. 2026. DOI: https://doi.org/10.3390/diagnostics16091360
Lu Y, Li X, El‐Aty AMA, Ju X, Yong Y. Host-Microbe Interactions: Prospects of Machine Learning and Deep Learning Technologies in Animal Viral Disease Management. Veterinary Sciences. 2025. DOI: https://doi.org/10.3390/vetsci12121129
Shafi M, Shabir S, Jan S, Wani ZA, Rather M, Beigh Y, et al.. The role of artificial intelligence in detecting avian influenza virus outbreaks: A review. Open Veterinary Journal. 2025. DOI: https://doi.org/10.5455/OVJ.2025.v15.i5.4
Alberts F, Berke O, Rocha L, Keay S, Maboni G, Poljak Z. Predicting host species susceptibility to influenza viruses and coronaviruses using genome data and machine learning: a scoping review. Frontiers in Veterinary Science. 2024. DOI: https://doi.org/10.3389/fvets.2024.1358028
Olawade D, Ezeagu CN, Alisi CS, David-Olawade AC, Eniola DM, Akingbala T, et al.. AI-Driven Strategies for Enhancing Mpox Surveillance and Response in Africa: A Review.. Journal of Virological Methods. 2025. DOI: https://doi.org/10.1016/j.jviromet.2025.115270