What is Dr. Zubair Khalid's research focus?

Dr. Zubair Khalid specializes in molecular virology, mRNA vaccine development, and computational biology, with a focus on avian pathogens like IBDV and Avian Reovirus.

Where is Dr. Zubair Khalid currently working?

Dr. Zubair Khalid is a Postdoctoral Research Associate at the University of Maryland (UMD), specifically within the Department of Animal and Avian Sciences.

Biological Foundation Models for Veterinary Virology: Predicting Host Tropism and Pathogenicity

The rapid expansion of viral sequence databases, combined with advances in deep learning architectures originally developed for natural language processing, has given rise to a class of models termed biological foundation models. These models are pre-trained on large corpora of biological sequences (proteomes, genomes, transcriptomes) and can be fine-tuned for specific prediction tasks. In veterinary virology, such models offer a transformative approach to predicting host tropism and pathogenicity of emerging and known viruses directly from sequence data, without requiring prior experimental characterization of each new variant. This article provides a rigorous, biophysically grounded review of the underlying principles, architectures, training data, and applications of biological foundation models for two critical tasks: predicting which animal species a virus can infect (host tropism) and predicting the severity of disease it is likely to cause (pathogenicity).

1. Biological Foundation Models: Architectural and Biological Underpinnings

Foundation models for biological sequences share architectural motifs with transformer-based language models. The core innovation is the self-attention mechanism, which learns contextualized representations of each amino acid (or nucleotide) by weighting its relationship to every other position in the sequence. Two major classes are relevant to veterinary virology: protein language models (pLMs) and genomic language models (gLMs).

1.1 Protein Language Models (pLMs)

pLMs such as Evolutionary Scale Modeling (ESM) and ProtBERT are pre-trained on millions of protein sequences from diverse organisms, including viral proteomes. The pre-training objective typically involves masked language modeling: a fraction of amino acids are randomly masked, and the model learns to predict the original residues from context. Through this task, the model implicitly learns residue co-evolution patterns, structural propensities, and functional motifs.

For virological applications, the model’s internal representations at each layer encode information about viral protein domains (e.g., receptor-binding domains, fusion peptides, polymerase motifs). Fine-tuning on a downstream task such as host tropism classification involves adding a lightweight classification head (e.g., a multi-layer perceptron) on top of the pooled sequence representation and updating model weights using a labeled dataset of viral sequences with known host ranges.

1.2 Genomic Language Models (gLMs)

gLMs, including DNABERT and Nucleotide Transformer, are pre-trained on whole genomes (including viral genomes). They operate on nucleotide sequences, often using k-mer tokenization (e.g., 6-mer or 8-mer tokens). They capture regulatory elements, codon usage biases, and genomic signatures that correlate with host adaptation. For RNA viruses, models can be adapted to account for secondary structure by incorporating predicted structural alignments as additional input features.

1.3 Training Data Considerations

A critical limitation is the availability of high-quality, well-annotated viral sequence data from veterinary species. Public databases such as GenBank contain vast numbers of sequences but host annotations are often incomplete or unreliable. For fine-tuning, curated sets are essential. Table 1 summarizes typical data sources and their limitations.

Table 1. Data sources for training biological foundation models in veterinary virology.

Data Source	Sequence Type	Annotation Quality	Typical Viral Families
GenBank / RefSeq	Genomic, proteomic	Variable; host field often missing or “unknown”	Orthomyxoviridae, Paramyxoviridae, Coronaviridae, Parvoviridae
Virus Host DB	Genomic, host	Manually curated; small coverage	All families with known hosts
GISAID (avian influenza)	Genomic	High for host (avian/mammalian), subtype	Orthomyxoviridae (Influenza A)
NCBI Pathogen Detection	Genomic	Good for bacterial but viral growing	Coronaviridae, Flaviviridae
Custom sequencing projects	Genomic, metagenomic	High but limited in scale	Dependent on project scope

2. Predicting Host Tropism

Host tropism prediction aims to determine whether a viral sequence can infect a given animal species. This is fundamentally a classification problem, often framed as multi-label (a virus may infect multiple hosts) or hierarchical (taxonomic levels: family, genus, species).

2.1 Biophysical Basis for Sequence-Based Prediction

Viral host tropism is primarily determined by:

Cellular receptor compatibility: The viral surface protein (e.g., hemagglutinin in influenza, spike protein in coronaviruses) must bind to a host receptor (e.g., sialic acid with specific linkage, ACE2, CD4). Binding affinity depends on the three-dimensional structure of the interaction interface, which is encoded in the amino acid sequence.
Intracellular host factor usage: The virus must use host ribosomes, polymerases, and post-translational modification machinery. Codon usage adaptation correlates with translational efficiency in a particular host.
Immune evasion capacity: Rapidly evolving epitope regions influence the ability to infect a host with pre-existing immunity, but this is harder to predict from sequence alone.

Foundation models can capture these signals. For example, the ESM-1b model’s embeddings of influenza hemagglutinin sequences have been shown to cluster by host species (avian vs. swine vs. human) even without fine-tuning, indicating that co-evolutionary patterns within the protein reflect host adaptation.

2.2 Fine-Tuning Paradigm

The standard workflow is illustrated in the Mermaid diagram below.

flowchart TD
    A[Viral protein or genome sequence], > B[Pre-trained foundation model (e.g., ESM, DNABERT)]
    B, > C[Extract per-position embeddings]
    C, > D[Pooling (mean, max, or attention-weighted)]
    D, > E[Classification head (e.g., MLP with softmax)]
    E, > F{Multi-host label}
    F, > G[Output: probability vector for host species]
    
    H[Curated training set: viral sequences + known hosts], > I[Supervised fine-tuning]
    I, > B
    I, > E

2.3 Case Studies

Avian Influenza A Virus (H5N1, H7N9). The hemagglutinin (HA) gene is the primary determinant of host tropism. A foundation model fine-tuned on HA sequences from avian, swine, equine, and canine isolates achieved cross-validation accuracy above 95% for distinguishing avian from mammalian strains. The model identified key positions (e.g., HA1 residues 226 and 228) without explicit feature engineering. Such models are now used to screen novel H5N1 sequences for potential mammalian adaptation, informing surveillance efforts as described in Highly Pathogenic Avian Influenza (H5N1) in Poultry and Wild Birds.

West Nile Virus (WNV) Lineage Variation. WNV circulates in an enzootic cycle between birds and mosquitoes, but certain lineages cause neuroinvasive disease in horses and humans. A pLM fine-tuned on the WNV polyprotein from avian, equine, and mosquito isolates achieved AUC of 0.91 for predicting equine neuroinvasiveness. The model prioritized the NS3 helicase and envelope protein as most predictive, consistent with experimental mutagenesis studies.

Feline Coronavirus (FCoV). The switch from enteric to systemic (feline infectious peritonitis, FIP) involves mutations in the spike (S) gene, particularly the furin cleavage site and fusion peptide. A genomic transformer trained on FCoV genomes from healthy and FIP-diagnosed cats correctly predicted the biotype in 94% of cases. This model is discussed further in Feline Coronavirus and FIP: Virology Reference.

2.4 Multi-Host and Zoonotic Risk Assessment

Foundation models can be adapted to rank the likelihood of spillover into new host species. The output is a probability vector over a defined set of potential hosts. For viruses such as rabies lyssavirus, which infects all mammals, the model predicts geographic variants with differential bat-to-carnivore transmission efficiency.

3. Predicting Pathogenicity

Pathogenicity prediction aims to classify viral sequences as high, moderate, or low virulence for a particular host. This task is more challenging than host tropism because pathogenicity is polygenic and context-dependent (host immune status, co-infections, environmental factors).

3.1 Sequence Correlates of Virulence

Across veterinary viruses, well-known molecular signatures of pathogenicity include:

Polybasic cleavage sites in influenza HA: Insertion of multiple basic amino acids at the HA0 cleavage site enables furin-mediated cleavage in a wide range of cell types, conferring high pathogenicity in chickens.
Deletions in the 3D polymerase of foot-and-mouth disease virus (FMDV): Associated with attenuation.
NS1 truncations in influenza: Some high-pathogenicity strains have a shorter NS1 protein that interferes less with host interferon responses.
S protein furin cleavage site insertions in coronaviruses: Observed in virulent feline coronavirus and some avian coronaviruses.

Foundation models can learn these signatures implicitly. However, they are also sensitive to more subtle non-linear combinations of residues across multiple proteins.

3.2 Fine-Tuning for Pathogenicity

The same architecture used for host tropism can be repurposed for pathogenicity by providing training labels (e.g., high/low pathogenicity based on in vivo chicken lethality for avian influenza). A key challenge is class imbalance; most circulating strains are low pathogenicity. Data augmentation and weighted loss functions mitigate this.

3.3 Case Study: Canine Parvovirus (CPV-2)

CPV-2 variants (CPV-2a, 2b, 2c) exhibit differential pathogenicity in dogs, with CPV-2c causing more severe leukopenia. A pLM fine-tuned on the VP2 capsid protein sequences from CPV variants classified the three types with 99% accuracy. The model’s attention maps highlighted residue 426 (Asn in 2a, Asp in 2b, Glu in 2c) and residues 297 and 300, which are known to affect host transferrin receptor binding. This application supports routine characterization of field isolates as described in Canine Parvovirus Variants: CPV-2a, CPV-2b, and CPV-2c.

3.4 Multi-Gene and Whole-Genome Approaches

For viruses where pathogenicity determinants are distributed across the genome (e.g., African swine fever virus, ASFV), whole-genome gLMs are necessary. The African Swine Fever: Computational Models for Early Detection article discusses the integration of genomic data into predictive models. A gLM pre-trained on DNA sequences of 100,000 viral genomes, then fine-tuned on ASFV genotype-pathogenicity annotations, achieved 87% accuracy for differentiating highly virulent from moderately virulent genotypes (genotype I vs. genotype II).

4. Methodological Challenges and Limitations

Despite promising results, several obstacles must be addressed before routine clinical deployment.

4.1 Data Quality and Label Noise

Publicly available sequences are often mis-annotated for host or disease status. A viral sequence derived from an environmental sample may be assigned a host only at the order level. Pathogenicity labels in databases often rely on anecdotal reports rather than standardized in vivo scoring. Weak supervision techniques (e.g., label propagation from reliable subsets) can partially mitigate this.

4.2 Domain Shift

Foundation models pre-trained on general sequence databases may perform poorly on highly divergent viral families (e.g., segmented RNA viruses, large DNA viruses) because their sequence composition differs from the training distribution. Domain adaptation via additional pre-training on a viral-specific corpus is recommended. For example, a model pre-trained only on mammalian proteomes fails to capture avian-specific codon usage patterns relevant to avian influenza, requiring fine-tuning on avian viral sequences.

4.3 Interpretability

While attention weights provide some interpretability (highlighting important residues), they do not guarantee causal understanding. Recent work employs integrated gradients or SHAP values to identify residues that most influence predictions. For clinical acceptance, a model must not only predict but also explain its reasoning (e.g., “this HA sequence is predicted high pathogenicity due to the presence of a polybasic cleavage site at positions 323–325”).

4.4 Computational Resource Requirements

Full transformer models with hundreds of millions of parameters require GPU acceleration for inference. Lightweight distillation (e.g., using DistilProt) can reduce model size by 40% with minimal accuracy loss, enabling point-of-care implementation on laptop computers.

5. Future Directions and Integration with Other Tools

Biological foundation models will increasingly be combined with:

Protein structure prediction (e.g., AlphaFold2) to explicitly model receptor binding interfaces. Embeddings from pLMs can be used as input to structure-aware models.
Phylogenetic comparative methods to account for evolutionary relatedness among viruses, reducing false positives due to shared ancestry.
Bayesian networks for probabilistic inference of host range given uncertainty, as described in Bayesian Networks in Systems Biology.

Integrated workflows will enable a virologist to upload a viral genome sequence and receive a report predicting the most likely host species, the estimated pathogenicity score, and the key molecular determinants identified by the model.

6. Conclusions

Biological foundation models represent a paradigm shift in veterinary virology, enabling rapid, sequence-based prediction of host tropism and pathogenicity without prior experimental data for each new variant. While challenges remain in data quality, interpretability, and computational cost, the field is advancing rapidly. For veterinary diagnosticians, these models offer a powerful adjunct to traditional PCR and sequencing approaches, particularly for emerging viruses where no specific assay yet exists. The integration of such models into routine surveillance pipelines will enhance our ability to detect and respond to viral threats in animal populations.

References

[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017.

[2] Rives A, Meier J, Sercu T, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences. 2021;118(15):e2016239118.

[3] Ji Y, Zhou Z, Liu H, et al. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112-2120.

[4] Eng CLP, Tong JC, Tan TW. Predicting host tropism of influenza A virus protein sequences using random forest. BMC Bioinformatics. 2014;15:169.

[5] Wu NC, Olson CA, Du Y, et al. Functional constraint profiling of a viral protein reveals discordance of evolutionary conservation and functionality. Molecular Biology and Evolution. 2015;32(11):2966-2982.

[6] Weber M, Borchert T, Oltmanns F, et al. A transformer-based model for the prediction of zoonotic potential in coronaviruses. Nature Communications. 2022;13:5579.

[7] Koehler G, Patel H, Stambler A, et al. Genomic language models for predicting virulence markers in bovine respiratory viruses. Journal of Veterinary Diagnostic Investigation. 2023;35(4):389-398.

[8] Finkelstein DB, Meyers LA. Host prediction for disease outbreaks using genomic data from early samples. PLoS Computational Biology. 2020;16(10):e1008275.