Biological Foundation Models for Predicting Host Tropism of Zoonotic Viruses

Introduction

The accurate prediction of host tropism for zoonotic viruses represents a critical capability in veterinary virology and one health surveillance. Host tropism, the range of species a virus can infect, is determined by a complex interplay of molecular interactions, primarily between viral surface proteins and host cell receptors. Traditional methods for determining tropism rely on in vitro binding assays, cell culture susceptibility testing, and experimental infection of animal models. These approaches are resource intensive, time consuming, and cannot scale to the volume of novel viral sequences generated by modern metagenomic surveillance programs.

Biological foundation models offer a computational solution to this bottleneck. These models are deep learning architectures pretrained on vast repositories of biological sequence data, enabling them to learn universal representations of protein structure and function. When fine tuned for specific tasks such as receptor binding prediction or host range classification, these models can generalize to unseen viral variants and emerging pathogens. This review examines the architectural principles, training paradigms, and biophysical foundations of foundation models applied to predicting host tropism, with a focus on viral glycoproteins and host receptor compatibility.

Biophysical Basis of Host Tropism

Host tropism is fundamentally constrained by the molecular compatibility of viral entry machinery with host cellular factors. For enveloped viruses, the key determinants include the viral attachment protein (often a glycoprotein or hemagglutinin), the host cell surface receptor, and the cellular protease machinery required for fusion activation. For nonenveloped viruses, capsid proteins mediate receptor engagement and cell entry.

The interaction between viral glycoproteins and host receptors is governed by thermodynamic and kinetic parameters. Binding affinity, measured as the dissociation constant (Kd), directly correlates with infection efficiency. However, affinity alone is insufficient for tropism prediction. The structural complementarity of the binding interface, the presence of specific glycosylation patterns, and the pH dependence of conformational changes all contribute to host range limitations.

Sialic acid linkage specificity provides a well characterized example from influenza A virus. Avian influenza viruses preferentially bind alpha-2,3 linked sialic acids, while mammalian adapted strains bind alpha-2,6 linked sialic acids. This linkage specificity is determined by the topology of the hemagglutinin receptor binding site, specifically the orientation of conserved amino acid residues at positions 226 and 228 (H3 numbering). A single amino acid substitution at position 226 from glutamine to leucine can shift binding preference from avian to human type receptors.

Architecture of Biological Foundation Models

Biological foundation models are distinguished from conventional machine learning approaches by their scale and pretraining methodology. These models are trained on massive unlabeled sequence databases using self supervised learning objectives. The pretraining phase allows the model to capture latent features of protein sequences, including evolutionary conservation, structural motifs, and coevolutionary couplings, without requiring labeled examples of host tropism.

The core architectural components include:

Transformer encoders. These neural network layers use multihead self attention mechanisms to process variable length sequence inputs. Each amino acid position attends to every other position in the sequence, enabling the model to learn long range dependencies critical for protein function. The self attention mechanism assigns weights to pairwise interactions, effectively learning a sequence specific interaction map without explicit structural input.

Protein language model embeddings. The output of the transformer encoder generates per position embeddings. These embeddings can be aggregated (typically via mean pooling or using a classification token) to produce a fixed length sequence representation. These embeddings serve as feature vectors for downstream classification or regression tasks.

Multimodal fusion layers. For host tropism prediction, sequence information alone may be insufficient. Advanced architectures incorporate auxiliary inputs including structural predictions from AlphaFold or ESMFold, physicochemical properties of amino acid side chains, and host receptor sequence information. Multimodal fusion layers integrate these heterogeneous data types through cross attention or concatenation before passing to the prediction head.

Prediction heads. The final layers of the model are task specific. For host tropism classification, a softmax output layer predicts probability scores across a predefined set of host species categories. For receptor binding prediction, a regression head may predict binding affinity values. Some architectures employ multi task learning, simultaneously predicting multiple related properties to improve generalization.

Training Strategies for Host Tropism Prediction

Pretraining on Viral Protein Repositories

Foundation models for viral applications are typically pretrained on large collections of viral protein sequences. These datasets may include all publicly available viral protein sequences from databases such as UniProt or GenBank. The pretraining objective is often a masked language modeling task, where a random subset of amino acids in each input sequence is masked, and the model learns to predict the masked residues from context. This process forces the model to learn the statistical regularities of viral protein evolution.

Alternatively, autoregressive modeling predicts the next amino acid in a sequence given all previous residues. This approach captures sequential dependencies but does not provide bidirectional context. Both approaches generate embeddings that encode functional information about the input sequences.

Fine Tuning with Curated Host Tropism Data

Pretrained models are adapted to host tropism prediction through supervised fine tuning. This requires a curated dataset of viral sequences with experimentally confirmed host tropism labels. The quality and breadth of this dataset directly determine model performance. Key considerations include:

Taxonomic balance. The training set should include viruses spanning multiple families, genera, and species. Models trained predominantly on influenza A virus may not generalize to paramyxoviruses or coronaviruses without appropriate representation.

Host species diversity. Host categories must reflect the taxonomic range of interest. Common categories include avian, swine, equine, canine, feline, and chiropteran hosts. For zoonotic risk assessment, the model must distinguish between viruses restricted to animal hosts and those capable of infecting humans.

Temporal and geographic variation. Viral sequences from different years and geographic regions may exhibit distinct tropism patterns. The training set should capture this variation to avoid temporal bias.

Few Shot and Zero Shot Learning

An emerging capability of foundation models is their ability to perform few shot and zero shot prediction. In the few shot setting, the model receives a small number of labeled examples (e.g., five to ten sequences per host category) and adapts its predictions accordingly. This is achieved through in context learning, where labeled examples are included in the input prompt alongside the query sequence.

Zero shot prediction relies entirely on the model's pretrained knowledge. For host tropism, zero shot performance depends on the degree to which tropism determinants are encoded in the sequence itself. If the model has learned to associate specific sequence motifs with particular host species during pretraining, it may predict tropism without explicit fine tuning. However, zero shot accuracy remains substantially lower than fine tuned performance for most viral families.

Evaluation and Validation

Rigorous evaluation of host tropism prediction models requires carefully constructed test sets. A critical consideration is sequence identity between training and test sequences. High sequence similarity (e.g., greater than 90% identity) can inflate performance metrics because the model memorizes rather than generalizes. Clustering sequences at a threshold (typically 70% to 80% identity) before splitting into training, validation, and test sets provides a more realistic assessment of generalization.

Common evaluation metrics include:

Accuracy. The proportion of correctly classified sequences. This metric is sensitive to class imbalance and may be misleading when host categories are unevenly represented.

Precision and recall. Precision measures the proportion of predicted positive cases that are truly positive. Recall measures the proportion of true positive cases that the model correctly identifies. The F1 score provides a harmonic mean of these two metrics.

Area under the receiver operating characteristic curve (AUC-ROC). This metric evaluates the model's ability to discriminate between host categories across different probability thresholds. An AUC-ROC of 1.0 indicates perfect discrimination, while 0.5 indicates random performance.

Calibration. Well calibrated models produce probability estimates that reflect true frequencies. For example, if the model predicts a 0.80 probability of avian tropism for a set of sequences, approximately 80% of those sequences should be avian adapted. Calibration curves and expected calibration error quantify this property.

Case Study: Influenza A Virus Host Tropism Prediction

Influenza A virus serves as a model system for host tropism prediction due to its well characterized receptor binding properties and extensive sequence databases. Eng et al. (2014) applied a random forest classifier to predict influenza A virus host tropism from protein features [1]. Their approach extracted features from the hemagglutinin and neuraminidase proteins, including amino acid composition, dipeptide composition, and physicochemical properties. The random forest model achieved high classification accuracy for distinguishing avian, swine, and human influenza isolates.

Foundation model approaches extend this work in several directions. First, transformer based models can process full length hemagglutinin sequences without manual feature engineering, potentially capturing epistatic interactions between distant residues that influence receptor binding. Second, pretrained embeddings may generalize to influenza subtypes and lineages poorly represented in labeled training data. Third, multimodal architectures can integrate structural predictions from homology models or deep learning based structure prediction tools.

For equine influenza, foundation models must account for the specific receptor binding adaptations of H3N8 and H7N7 subtypes. Equine tracheal epithelium expresses both alpha-2,3 and alpha-2,6 linked sialic acids, and equine influenza viruses have evolved binding specificities that favor particular glycan topologies. Models trained on avian and human influenza may misclassify equine isolates unless equine specific sequences are included in the training set.

Limitations and Challenges

Despite their promise, biological foundation models for host tropism prediction face several limitations.

Data scarcity for rare host species. Most available viral sequences originate from humans, poultry, swine, and cattle. Viruses with tropism for wildlife species, such as chiropteran hosts, are underrepresented in public databases. Models trained predominantly on livestock host data may produce unreliable predictions for bat borne zoonoses.

Receptor independent determinants of tropism. Host tropism is not solely determined by receptor binding. Intracellular factors, including antiviral restriction factors, RNA interference pathways, and innate immune signaling components, modulate species specific permissiveness. Foundation models trained exclusively on surface protein sequences cannot capture these post entry determinants.

Temporal evolution of tropism. Viral tropism can shift over time through accumulation of mutations. A model trained on historical sequences may not accurately predict the tropism of emerging variants. Continuous retraining or online learning frameworks are needed to maintain predictive accuracy.

Interpretability. Deep learning models, particularly large transformer architectures, function as black boxes. Identifying the specific sequence features driving a tropism prediction is challenging. Attention weights and saliency maps provide partial interpretability but do not offer mechanistic explanations at the atomic level.

Future Directions

Several methodological advances are poised to improve foundation model performance for host tropism prediction.

Integration with structural biology. The combination of sequence based embeddings with experimentally determined or computationally predicted three dimensional structures will enable models to reason about binding interfaces with atomic resolution. Graph neural networks operating on protein structures can directly model the geometry of receptor binding sites.

Dynamic host receptor databases. Foundation models could benefit from incorporating host receptor sequence variation across species. A model that simultaneously embeds viral glycoprotein sequences and host receptor sequences (e.g., ACE2 for coronaviruses or sialyltransferases for influenza) could predict cross species compatibility in a pairwise manner.

Uncertainty quantification. Reliable host tropism predictions for veterinary and public health applications require calibrated uncertainty estimates. Bayesian neural networks and ensemble methods can provide prediction intervals that indicate when a sequence falls outside the model's training distribution, flagging cases requiring experimental validation.

Multispecies epidemiology models. Integration of foundation model predictions with ecological and epidemiological models of pathogen transmission could improve risk assessments for spillover events. Predicted host range from molecular data can inform surveillance priorities for wildlife livestock interfaces.

Conclusion

Biological foundation models represent a transformative approach to predicting host tropism of zoonotic viruses. By learning universal representations of viral protein sequences from large scale pretraining, these models can generalize to novel variants and emerging pathogens with minimal labeled data. The integration of sequence, structural, and host receptor information within multimodal architectures captures the molecular determinants of cross species transmission. Continued advances in training strategies, evaluation frameworks, and uncertainty quantification will enhance the reliability of these predictions for veterinary diagnostics and one health surveillance.

The successful deployment of foundation models for host tropism prediction requires close collaboration between computational scientists and veterinary virologists. Experimental validation of computational predictions remains essential, particularly for viruses with sparse training data. As the volume of viral sequence data continues to expand, foundation models will become increasingly central to the early detection and risk assessment of zoonotic threats.

References

[1] Eng CL, Tong JC, Tan TW. Predicting host tropism of influenza A virus proteins using random forest. BMC Med Genomics. 2014. https://pubmed.ncbi.nlm.nih.gov/25521718/