What is Dr. Zubair Khalid's research focus?

Dr. Zubair Khalid specializes in molecular virology, mRNA vaccine development, and computational biology, with a focus on avian pathogens like IBDV and Avian Reovirus.

Where is Dr. Zubair Khalid currently working?

Dr. Zubair Khalid is a Postdoctoral Research Associate at the University of Maryland (UMD), specifically within the Department of Animal and Avian Sciences.

-- title: "Biological Foundation Models for Predicting Host Tropism in Emerging Zoonotic Viruses" category: "infectious-disease-epi" metaDescription: "A technical review of biological foundation models, including ESMFold and protein language models, for predicting host tropism in emerging zoonotic viruses, with applications in veterinary virology and One Health surveillance." primaryKeyword: "biological foundation models host tropism" secondaryKeywords: ["ESMFold", "protein language models", "zoonotic virus prediction", "viral receptor binding", "One Health", "GISAID", "VirusHostDB"]

Biological Foundation Models for Predicting Host Tropism in Emerging Zoonotic Viruses

Introduction

The emergence of zoonotic viruses from animal reservoirs into human populations represents a persistent threat to global health. In veterinary virology, understanding the host tropism of these pathogens is critical for surveillance, risk assessment, and the development of intervention strategies. Host tropism, the range of species a virus can infect, is largely determined by molecular interactions between viral surface proteins and host cell receptors. Traditional experimental methods for characterizing these interactions are labor-intensive and slow, particularly when confronted with novel or rapidly evolving viruses. Biological foundation models, a class of deep learning architectures trained on vast protein sequence and structure data, have emerged as powerful tools for predicting host tropism directly from genomic information. This article provides a technical review of these models, focusing on ESMFold and protein language models, their application to spike-receptor binding prediction, the data sources that underpin them, and their integration into One Health frameworks for emerging zoonotic virus surveillance.

Biological Foundation Models: An Overview

Biological foundation models are large-scale neural networks pre-trained on massive corpora of biological sequences, typically protein or nucleotide sequences. These models learn a rich representation of biological language, capturing evolutionary, structural, and functional patterns without explicit supervision. The most prominent examples in protein science include the Evolutionary Scale Modeling (ESM) family, which uses transformer architectures similar to those in natural language processing. ESMFold, a derivative of the ESM model, directly predicts three-dimensional protein structures from sequence alone, bypassing the need for multiple sequence alignments that are required by traditional homology-based methods like AlphaFold2.

The key advantage of foundation models for host tropism prediction lies in their ability to generalize across diverse viral families. By learning the statistical properties of millions of protein sequences, these models can encode subtle features of receptor binding domains (RBDs) that correlate with host specificity. For example, the ESM-1b model, with 650 million parameters, has been shown to capture mutational effects on protein function, enabling zero-shot predictions of how sequence changes alter binding affinity to host receptors.

Predicting Host Tropism via Spike-Receptor Binding

Host tropism in enveloped viruses is often governed by the interaction between the viral spike glycoprotein and a specific host cell receptor. For coronaviruses, the spike protein's receptor binding domain (RBD) engages with angiotensin-converting enzyme 2 (ACE2) or other receptors. For influenza A viruses, the hemagglutinin (HA) protein binds to sialic acid receptors, with the linkage type (alpha-2,3 vs. alpha-2,6) determining avian versus mammalian tropism. Biological foundation models can predict these interactions by analyzing the structural compatibility between the viral RBD and host receptor orthologs.

A typical workflow involves the following steps:

Sequence Acquisition: Retrieve viral spike or HA sequences from public repositories such as GISAID (for influenza and coronaviruses) or VirusHostDB (a curated database of virus-host interactions).
Structure Prediction: Use ESMFold or similar models to generate high-confidence three-dimensional structures of the viral RBD and the host receptor ectodomain.
Docking Simulation: Employ protein-protein docking algorithms (e.g., ClusPro, HADDOCK) to model the binding interface, or use learned interaction models that directly predict binding affinity from sequence embeddings.
Tropism Scoring: Compute a binding energy or compatibility score across a panel of potential host species. Species with favorable scores are considered permissive.

Table 1 summarizes the key components of this pipeline.

Table 1: Components of a Foundation Model-Based Host Tropism Prediction Pipeline

Component	Description	Example Tools/Models
Sequence data	Viral and host receptor sequences	GISAID, VirusHostDB, NCBI RefSeq
Structure prediction	Deep learning-based protein folding	ESMFold, OmegaFold
Embedding generation	Sequence-to-vector representation	ESM-1b, ESM-2, ProtBERT
Interaction prediction	Binding affinity or interface scoring	ProteinMPNN, geometric deep learning
Tropism classification	Species-level permissiveness	Random forest, logistic regression

Data Sources for Training and Validation

The performance of foundation models depends critically on the quality and diversity of training data. For host tropism prediction, two data sources are particularly relevant:

GISAID (Global Initiative on Sharing All Influenza Data) is the primary repository for influenza virus sequences, including avian, swine, and equine isolates. It also hosts coronavirus sequences, including those from SARS-CoV-2 and related bat and pangolin coronaviruses. GISAID provides metadata on host species, geographic origin, and collection date, enabling the construction of labeled datasets for supervised learning.

VirusHostDB is a manually curated database of virus-host interactions, covering a wide range of viral families. It includes experimentally verified receptor usage data, which can be used to validate predictions. For veterinary applications, VirusHostDB contains entries for livestock pathogens such as Bovine Coronavirus, Canine Coronavirus, and Feline Coronavirus.

Additional resources include the NCBI Virus database and the UniProt Knowledgebase, which provide sequence and functional annotation for viral proteins and host receptors.

Case Study: Influenza A Virus Host Tropism

Influenza A viruses (IAV) circulate in wild waterfowl and can spill over into domestic poultry, swine, horses, and occasionally humans. The hemagglutinin (HA) protein's receptor binding specificity is a major determinant of host range. Avian IAVs preferentially bind alpha-2,3-linked sialic acids, while human-adapted IAVs bind alpha-2,6-linked sialic acids. Swine tracheal epithelium expresses both linkages, making pigs a potential mixing vessel.

Eng et al. [1] developed a random forest classifier to predict host tropism of IAV proteins using sequence-derived features. Their model achieved high accuracy in distinguishing avian from human isolates based on HA and neuraminidase (NA) sequences. While this approach predates foundation models, it illustrates the feasibility of machine learning for tropism classification. Modern foundation models can extend this work by incorporating structural information and generalizing to novel subtypes.

A foundation model-based approach for IAV would involve:

Extracting HA sequences from GISAID for avian, swine, and equine isolates.
Using ESMFold to predict HA structures and identify the receptor binding site.
Computing ESM embeddings for each sequence and training a classifier on host labels.
Validating predictions against known receptor binding phenotypes from glycan microarray experiments.

The Mermaid diagram below illustrates the workflow.

flowchart TD
    A[GISAID Sequence Data], > B[ESMFold Structure Prediction]
    B, > C[Receptor Binding Site Identification]
    C, > D[ESM Embedding Extraction]
    D, > E[Classifier Training (Random Forest / Neural Network)]
    E, > F[Host Tropism Prediction]
    F, > G[Validation with VirusHostDB / Glycan Arrays]
    G, > H[One Health Risk Assessment]

One Health Applications

The One Health framework recognizes the interconnectedness of human, animal, and environmental health. Biological foundation models for host tropism prediction have direct applications in veterinary surveillance and outbreak preparedness.

Wildlife Surveillance: Bats, rodents, and birds are major reservoirs of emerging viruses. By sequencing viral samples from wildlife and applying foundation models, veterinary authorities can identify viruses with high spillover potential to livestock or companion animals. For example, predicting the tropism of novel bat coronaviruses for Feline Coronavirus receptors could inform risk assessments for domestic cats.

Livestock Protection: In poultry, Highly Pathogenic Avian Influenza (HPAI) H5N1 continues to cause outbreaks. Foundation models can rapidly assess whether newly emerged H5N1 strains have acquired mutations that enhance binding to mammalian receptors, signaling an increased risk to swine or humans. Similarly, for Porcine Reproductive and Respiratory Syndrome (PRRS), predicting tropism shifts could guide vaccine strain selection.

Companion Animal Medicine: Viruses such as Canine Distemper Virus and Feline Calicivirus can cross species barriers. Foundation models can help identify which wildlife reservoirs pose a threat to domestic dogs and cats, enabling targeted vaccination campaigns.

Limitations and Future Directions

Despite their promise, biological foundation models have several limitations. First, they require large amounts of high-quality sequence data, which may be unavailable for poorly studied viral families. Second, the models are trained on known protein sequences and may not generalize to entirely novel folds or unnatural amino acids. Third, host tropism is influenced by factors beyond receptor binding, including intracellular replication factors, immune evasion, and transmission dynamics. Integrating these factors into predictive models remains a challenge.

Future developments may include:

Multimodal models that combine sequence, structure, and metadata (e.g., host phylogeny, geographic distribution).
Active learning strategies to prioritize experimental validation of high-risk predictions.
Integration with real-time genomic surveillance platforms to provide near-instantaneous tropism assessments during outbreaks.

Conclusion

Biological foundation models represent a paradigm shift in the prediction of host tropism for emerging zoonotic viruses. By leveraging deep learning on protein sequences and structures, these models can rapidly assess the spillover potential of novel viruses from animal reservoirs. When combined with curated databases such as GISAID and VirusHostDB, and embedded within a One Health surveillance framework, they offer a powerful tool for veterinary virology and public health preparedness. Continued refinement of these models, along with experimental validation, will be essential for translating computational predictions into actionable risk assessments.

References

[1] Eng CL, Tong JC, Tan TW. Predicting host tropism of influenza A virus proteins using random forest. BMC Med Genomics. 2014;7 Suppl 3:S6. doi:10.1186/1755-8794-7-S3-S6. URL: https://pubmed.ncbi.nlm.nih.gov/25521718/