Section: Foundations & History

-- title: "Biological Foundation Models in Veterinary Virology: From ESMFold to Genomic Surveillance" category: "structural-biology" metaDescription: "A technical review of protein language models and genomic foundation models applied to veterinary virology, covering structure prediction, mutation analysis, and surveillance of livestock and zoonotic viruses with cross-host comparisons." primaryKeyword: "biological foundation models veterinary virology" secondaryKeywords: ["ESMFold", "protein language model", "genomic surveillance", "veterinary virology", "viral protein structure prediction", "ESM-2"]

Biological Foundation Models in Veterinary Virology: From ESMFold to Genomic Surveillance

Introduction

The application of deep learning to biological sequence data has produced foundation models that learn distributed representations of proteins and nucleic acids from large unlabeled corpora. In veterinary virology, these models offer unprecedented capabilities for predicting viral protein structures, characterizing mutation effects, and enabling genomic surveillance of livestock, poultry, and zoonotic pathogens. The global burden of enteric and respiratory viral diseases in production animals, as systematically cataloged in large epidemiological studies [1], underscores the need for rapid structural and functional annotation of emerging viral variants. Protein language models such as ESM-2 and its structure prediction module ESMFold, together with genomic foundation models trained on metagenomic assemblies, now provide a computational infrastructure to support these efforts.

This review examines the biological and algorithmic foundations of these models, their specific applications to veterinary viruses, and the integration of predicted structural data with field surveillance pipelines. Emphasis is placed on the biophysical interpretation of model outputs and the validation of predicted interactions using traditional virological methods.

Protein Language Models in Veterinary Virology

Protein language models (PLMs) employ transformer architectures trained on masked language modeling objectives using millions of protein sequences from diverse taxa. The Evolutionary Scale Modeling (ESM) family, particularly ESM-2, learns contextual embeddings that capture evolutionary constraints, secondary structure propensities, and residue-residue contact patterns. For viral proteins, these embeddings encode information about host-specific adaptations, immune evasion surfaces, and conformational flexibility in envelope or capsid proteins.

The applicability of PLMs to veterinary viruses has been demonstrated through multiple case studies. For instance, the identification of a conserved linear B-cell epitope in the capsid protein of porcine circovirus type 4 (PCV4) relied on structural predictions to guide peptide mapping [2]. ESMFold-generated models of the PCV4 capsid enabled the rational selection of epitope candidates that were subsequently validated by peptide ELISA. Such structure-guided epitope discovery is critical for developing serological assays in emerging swine pathogens where commercial reagents are unavailable.

Similarly, the structural modeling of classical swine fever virus (CSFV) E2 glycoprotein, a target for subunit vaccine development, has been enhanced by PLM-based predictions. Recombinant expression of E2 in Bacillus subtilis for oral vaccination requires precise knowledge of conformational epitopes that interact with neutralizing antibodies [6]. ESMFold predictions of the E2 ectodomain, compared with experimentally determined crystal structures from related pestiviruses, help identify solvent-exposed loops that can tolerate insertions or tags without disrupting immunogenicity.

ESMFold for Viral Protein Structure Prediction

ESMFold is an end-to-end structure prediction method that directly maps amino acid sequences to three-dimensional coordinates without a multiple sequence alignment stage. This characteristic is especially valuable for viral proteins with few homologs or high sequence diversity, such as those from rapidly mutating RNA viruses. The model uses a geometric transformer that iteratively refines residue positions based on pairwise distance and orientation predictions derived from the ESM-2 embedding.

For veterinary coronaviruses, including feline coronavirus (FCoV) and porcine epidemic diarrhea virus (PEDV), ESMFold has been applied to model the spike glycoprotein receptor-binding domain (RBD) and the spike S2 fusion machinery. The PEDV spike protein, targeted by mRNA vaccines designed as virus-like particles [9], requires accurate structural templates to ensure correct folding of assembled VLPs. ESMFold predictions of the PEDV spike ectodomain, although lower in confidence for flexible hinge regions, provide starting models for molecular dynamics simulations that assess stability under vaccine formulation conditions.

The prediction of protein-protein interfaces is another domain where ESMFold contributes. For example, the heterodimerization of grass carp Toll-like receptors TLR20.2 and TLR20.3, which mediates synergistic antiviral responses against reovirus [12], can be modeled using co-evolution features encoded in ESM-2 embeddings. Contact maps predicted by ESMFold for TLR20.2 and TLR20.3 guide mutagenesis experiments that identify interface residues critical for dsRNA recognition.

A summary of ESMFold features compared to other prediction methods is provided in Table 1.

Feature ESMFold AlphaFold2 RoseTTAFold
Sequence input Single sequence MSA required MSA required
Embedding model ESM-2 transformer Evoformer + structure module Three-track network
Inference speed Approximately 10 seconds per 500 residues Minutes to hours Minutes
Best suited for Orphan sequences, viral quasispecies High-confidence templates, conserved families Medium-throughput screening
Memory footprint (GPU) Moderate High Moderate

Genomic Foundation Models for Surveillance

Genomic foundation models extend the transformer paradigm to nucleotide sequences, learning representations from large metagenomic and viral genome repositories. Unlike traditional phylogenetics, these models encode contextual relationships between genomic regions across divergent taxa, facilitating the detection of recombination events, host range determinants, and adaptive mutations in real-time surveillance data.

In swine virology, genomic foundation models have been applied to analyze the diversity of PEDV subtypes, particularly the emerging G2c variant. A TaqMan-based RT-qPCR assay was developed to discriminate G2c from other subtypes [14]; the design of subtype-specific probes relied on embedding-based clustering of spike gene sequences. Genomic foundation models trained on alphacoronavirus genomes can project new PEDV sequences into a latent space where variants with similar receptor binding properties cluster together, enabling early warning of host jumps.

The surveillance of zoonotic viruses in bat reservoirs benefits from genomic foundation models that integrate host transcriptomic data. Bat organoid models have revealed species-specific immune pathways that restrict viral replication [4]; aligning viral genome embeddings with host gene expression embeddings from such organoids can identify viral proteins that antagonize bat interferon responses. This approach complements traditional phylogenetic methods by capturing non-homologous functional convergence.

An essential use case is the detection of co-infections and synergistic pathogenesis. Chicken infectious anemia virus (CIAV) markedly enhances the pathogenicity of infectious bronchitis virus (IBV) in chickens [10]. Genomic foundation models trained on co-occurrence patterns in metatranscriptomic data can predict pairs of viruses likely to exert synergistic effects based on their embedding distances in a joint latent space. These predictions then guide targeted surveillance in flocks presenting with atypical mortality.

Integration with Host Response and Antiviral Mechanisms

Foundation model predictions must be interpreted in the context of host biology. The activation of type I interferon signaling, for example through the JAK-STAT pathway, is a central antiviral mechanism in mammals. Ursodeoxycholic acid (UDCA) inhibits feline infectious peritonitis virus (FIPV) infection by activating this pathway [5]. ESMFold models of the FIPV 3C-like protease and its interaction with host STAT1 or STAT2 can inform how UDCA-mediated signaling changes may alter viral polyprotein processing. Similarly, recombinant porcine interferon alpha exhibits anti-PRRSV activity [7], and structural models of PRRSV nonstructural proteins can be screened against canine or feline interferon sequences to predict cross-species efficacy.

In fish virology, the grass carp reovirus (GCRV) immersion challenge model has elucidated innate immune responses involving TLR20 heterodimers [11, 12]. Foundation model embeddings of GCRV dsRNA segments, combined with predicted TLR20 structures, support the design of immunostimulatory RNA motifs that could be incorporated into feed additives.

Spatial and Multi-Omics Considerations

The transition from sequence-level predictions to tissue-level understanding requires integration with spatial transcriptomics. A murine model of neurotoxocariasis [13] demonstrated region-specific host responses in the brain; analogous approaches in veterinary virology, such as examining canine distemper virus neuroinvasion, can benefit from foundation models that predict viral protein interactions with blood-brain barrier transporters. Although neurotoxocariasis is parasitic, the methodological framework for correlating transcriptomic clusters with pathogen distribution applies directly to viral encephalitides.

Workflow Diagram

The following Mermaid diagram outlines an integrated pipeline that combines protein language models and genomic foundation models for veterinary virus surveillance and countermeasure design.

flowchart TD
    A[Field Sample Collection], > B[High-Throughput Sequencing]
    B, > C{Sequence Type}
    C, Viral Genome, > D[Genomic Foundation Model Embeding]
    C, Viral Protein Coding, > E[ESM-2 Embedding]
    D, > F[Clustering & Variant Classification]
    E, > G[ESMFold Structure Prediction]
    F, > H[Recombination Detection / Host Range Inference]
    G, > I[Epitope Mapping / Vaccine Design]
    I, > J[Subunit or VLP Expression]
    H, > K[Surveillance Report & Risk Assessment]
    K, > L[Targeted Diagnostic Assay Design]
    J, > L
    L, > M[Field Validation]
    M, > A

Case Study: Feline Coronavirus Spike Evolution

Feline coronavirus (FCoV) exists as two biotypes: enteric and feline infectious peritonitis virus (FIPV). The spike protein mutations that confer macrophage tropism are key determinants of biotype switching. ESMFold models of FCoV spike RBD from sequential isolates in endemic catteries can be compared to predict mutations that alter receptor binding affinity. When combined with genomic foundation model embeddings of whole-genome sequences, these predictions identify recombination hotspots that may accelerate the emergence of pathogenic variants. The recent demonstration that UDCA inhibits FIPV [5] provides a functional check: predicted structural changes in the spike due to common substitutions can be correlated with sensitivity to interferon-mediated antiviral states.

Cross-Host and One Health Implications

Foundation models trained across multiple host species, including humans, enable comparative analysis of zoonotic potential. For example, the insulin hijacking mechanism reported in a nematode [8] illustrates how host hormone systems can be subverted; similar strategies are employed by arteriviruses and coronaviruses. Genomic foundation models that incorporate host protein embeddings can identify viral proteins that mimic host signaling domains, a capability that informs the assessment of livestock viruses for human spillover risk.

Challenges and Future Directions

Several limitations constrain the routine deployment of foundation models in veterinary virology. First, model predictions lack explicit uncertainty quantification for disordered regions common in viral proteins. Second, training data are biased toward well-characterized viral families, reducing accuracy for understudied agents such as those in aquatic species (e.g., cyprinid herpesviruses). Third, host-specific post-translational modifications, which are critical for envelope glycoprotein function, are not captured by models that only use sequence inputs.

Future developments should include the incorporation of glycan and lipid context into protein language models, as well as the fine-tuning of genomic foundation models on host-specific metagenomic assemblies from livestock and wildlife reservoirs. The integration of foundation model outputs with point-of-care diagnostics, such as CRISPR-based platforms for avian influenza, could transform field identification of antigenic variants.

Conclusion

Biological foundation models, from ESMFold to genomic language models, are reshaping the computational toolkit of veterinary virology. By enabling accurate structure prediction for poorly characterized viral proteins and providing a unified representation of viral genomes, these models accelerate the discovery of neutralizing epitopes, facilitate real-time surveillance of emerging variants, and support rational vaccine design. Continued validation against experimental data from animal models, such as IFNAR2 knockout pigs [3] and bat organoids, will ensure that the predictions translate into practical diagnostic and prophylactic tools. The incorporation of these methods into national veterinary surveillance programs, coordinated through platforms such as GISAID for influenza and the Global Initiative on Sharing All Avian Data, promises to enhance preparedness for both endemic livestock diseases and zoonotic threats.

References

  1. GBD 2023 Diarrhoeal Disease and Enteric Infectious Diseases Collaborators. Global burden of enteric infectious diseases, diarrhoeal diseases, and corresponding aetiologies, 1990-2023: a systematic analysis for the Global Burden of Disease Study 2023. Lancet Infect Dis. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42229499/

  2. Xu L, Chen H, Ji L, et al. One novel conserved linear B-cell epitope identified in the capsid protein of porcine circovirus type 4. Vet Microbiol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42214286/

  3. Kim GA, Yeon JH, Lee J, et al. Establishment of an IFNAR2 Knockout Pig Model for Severe Dengue-Like Pathology. J Med Virol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42171447/

  4. Madden SR, Rynda-Apple A, Bimczok D. Leveraging organoid models to understand mechanisms of viral infections and immunity in bats. Dis Model Mech. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42125925/

  5. Zhong Y, Sun Z, Song Z, et al. Ursodeoxycholic acid inhibits feline infectious peritonitis virus infection through activating JAK-STAT signaling pathway-induced type I interferon. Microbiol Spectr. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42112801/

  6. Zhao Y, Ma Z, Zhang Z, et al. Immune responses triggered by oral administration of recombinant Bacillus subtilis expressing the E2 protein of classical swine fever virus. Virology. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42096829/

  7. Li F, Xiao H, Xiao X, et al. Study on the anti-PRRSV effect of recombinant porcine alpha interferon. Vet Immunol Immunopathol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42025227/

  8. Huang X, Du Z, Chen X, et al. Host insulin hijacking by a nematode receptor mediates developmental plasticity and sex ratio shifts. Nat Commun. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42020424/

  9. Yang M, Zhao Y, Guo W, et al. Development of a vaccine based on mRNA assembly of PEDV virus-like particle. J Virol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42012185/

  10. Chen H, Ma S, Yuan L, et al. Chicken Infectious Anemia Virus Markedly Enhances the Pathogenicity of Infectious Bronchitis Virus-Infected Chickens. Transbound Emerg Dis. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42007476/

  11. Chen G, Lv M, Xu Y, et al. An immersion challenge model for type II grass carp reovirus (GCRV-II) induces effective infection and activates antiviral immune response in grass carp. Fish Shellfish Immunol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42001979/

  12. Wang P, Wang S, Xiong N, et al. Non-ligand-binding TLR20.2 and dsRNA-binding TLR20.3 form heterodimer for synergistic antiviral response in grass carp. J Immunol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42001516/

  13. Zou M, Liu S, Chen Y, et al. Spatial transcriptomic atlas of murine neurotoxocariasis reveals region-specific host responses and dysfunction in the brain. Nat Commun. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41997914/

  14. Fan Y, Li X, Mo J, et al. A novel TaqMan-based RT-qPCR assay for the detection of PEDV and discrimination of the G2c subtype. Arch Virol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41995904/

  15. GBD 2023 MASLD Collaborators. Global burden of metabolic dysfunction-associated steatotic liver disease, 1990-2023, and projections to 2050: a systematic analysis for the Global Burden of Disease Study 2023. Lancet Gastroenterol Hepatol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41990758/