What is Dr. Zubair Khalid's research focus?

Dr. Zubair Khalid specializes in molecular virology, mRNA vaccine development, and computational biology, with a focus on avian pathogens like IBDV and Avian Reovirus.

Where is Dr. Zubair Khalid currently working?

Dr. Zubair Khalid is a Postdoctoral Research Associate at the University of Maryland (UMD), specifically within the Department of Animal and Avian Sciences.

-- title: "Biological Foundation Models for Predicting Host Tropism of Zoonotic Viruses" category: "infectious-disease-epi" metaDescription: "A detailed review of ESM-2 and ProtGPT2 biological foundation models for predicting cross-species transmission and host tropism of paramyxoviruses and coronaviruses in veterinary virology." primaryKeyword: "biological foundation models host tropism prediction" secondaryKeywords: ["ESM-2", "ProtGPT2", "zoonotic virus prediction", "paramyxovirus host range", "coronavirus cross-species transmission", "protein language model virology"]

Biological Foundation Models for Predicting Host Tropism of Zoonotic Viruses: ESM-2 and ProtGPT2 Applied to Paramyxoviruses and Coronaviruses

Introduction

Predicting the host tropism of zoonotic viruses is a central objective in veterinary virology and computational epidemiology. The capacity of a virus to infect a new host species depends on molecular compatibility between viral surface proteins and host cell receptors, as well as intracellular host factors that govern replication permissiveness. Traditional approaches to host range prediction have relied on phylogenetic analyses, experimental receptor binding assays, and feature-based machine learning models trained on sequence motifs or physicochemical properties [1]. However, these methods often fail to generalize across diverse viral families due to limited training data and the inability to represent complex protein sequence-structure-function relationships.

Biological foundation models represent a paradigm shift in this domain. These models, including the Evolutionary Scale Modeling (ESM-2) architecture and ProtGPT2, are large-scale protein language models trained on hundreds of millions of protein sequences. They learn distributed representations of amino acid sequences that capture evolutionary, structural, and functional information without explicit supervision. By fine-tuning these models on curated host-pathogen interaction datasets, researchers can predict cross-species transmission potential with unprecedented accuracy. This article provides an exhaustive technical examination of how biological foundation models are applied to predict host tropism, with a focused emphasis on paramyxoviruses and coronaviruses in vertebrate animal hosts.

Biological Foundation Models: Architecture and Training

ESM-2: Evolutionary Scale Modeling with Transformer Architectures

ESM-2 is a transformer-based protein language model trained on sequences from the UniRef database. The model employs a masked language modeling objective: random amino acid positions within a sequence are masked, and the model learns to predict the native residue based on surrounding context. This approach forces the model to internalize evolutionary couplings, residue co-variation patterns, and structural constraints.

The key innovation of ESM-2 relative to earlier versions is the use of a geometric attention mechanism that explicitly encodes spatial relationships between residues. The model processes sequences up to several thousand residues in length, making it suitable for viral glycoproteins such as the paramyxovirus fusion (F) protein (approximately 550 residues) and the coronavirus spike (S) protein (approximately 1,200 to 1,400 residues). Embeddings extracted from the final hidden layers of ESM-2 serve as high-dimensional feature vectors (typically 2,560 dimensions for the largest variant) that represent each residue in its full evolutionary and structural context.

ProtGPT2: Generative Language Modeling for Viral Sequence Space

ProtGPT2 is a generative protein language model based on the GPT-2 architecture. Unlike ESM-2, which is trained for representation learning, ProtGPT2 is trained to generate protein sequences by predicting the next amino acid in a sequence given all preceding residues. This autoregressive framework allows the model to learn the underlying grammar of protein sequence space.

In the context of host tropism prediction, ProtGPT2 can be used to generate plausible viral variants that might adapt to new host species. By conditioning the model on sequences known to infect a particular host, researchers can enumerate sequence changes that preserve structural integrity while altering receptor binding specificity. The model's internal representations can also be extracted as features for downstream classification tasks, providing a complementary perspective to ESM-2 embeddings.

Comparison of Model Features

Feature	ESM-2	ProtGPT2
Architecture	Transformer encoder (masked language model)	Transformer decoder (autoregressive)
Training objective	Predict masked residues	Predict next residue
Output type	Fixed-length embeddings per sequence or residue	Generated sequences and hidden states
Primary application in tropism prediction	Feature extraction for supervised classifiers	Sequence generation for adaptive variant enumeration
Sequence length capacity	up to 2,048 residues with sliding window	up to 1,024 residues
Resolution	Residue-level and sequence-level representations	Sequence-level generative probability distributions

Predicting Host Tropism: Methodological Framework

Representing Viral Surface Proteins

Host tropism is primarily determined by the interaction between viral attachment proteins and host cell receptors. For paramyxoviruses, the hemagglutinin-neuraminidase (HN) protein (or its structural homologs) binds to sialic acid-containing receptors. For coronaviruses, the receptor-binding domain (RBD) within the S1 subunit of the spike protein engages host receptors such as angiotensin-converting enzyme 2 (ACE2) or dipeptidyl peptidase 4 (DPP4). The sequence of these proteins is the primary input for foundation model-based prediction.

The pipeline proceeds as follows. Full-length viral glycoprotein sequences are obtained from curated databases such as GenBank or GISAID. Sequences are standardized to a uniform length through padding or truncation, typically at 1,024 residues for ESM-2 processing. Each sequence is passed through the foundation model without fine-tuning (zero-shot) or after supervised fine-tuning on a labeled dataset of host-assigned sequences. The resulting embedding vectors are projected into a lower-dimensional space using principal component analysis or uniform manifold approximation and projection for visualization and downstream classification.

Training Data and Label Construction

A critical challenge is the construction of a reliable ground truth dataset for host tropism. For veterinary applications, each sequence must be annotated with the animal species from which it was isolated. However, isolation source does not always equal natural host tropism. A virus isolated from a spillover event in a dead-end host may not represent a productive infection capable of onward transmission. Therefore, curated labels must incorporate additional evidence including experimental infection studies, serological surveys, and epidemiological linkage.

For this review, the training corpus is constructed as follows. Positive examples (high confidence host-pathogen pairs) require at least two independent isolations from the same host species in distinct geographic regions and at least one experimental inoculation study confirming replication and shedding. Negative examples are drawn from species that are known to lack the appropriate receptor orthologs or that have been repeatedly shown to resist infection in controlled settings. This binarization (compatible versus incompatible) simplifies the prediction task while acknowledging that host tropism exists on a continuum.

Supervised Fine-Tuning and Classification

After extracting embeddings from ESM-2 or ProtGPT2, a linear classifier or a shallow neural network is trained on the labeled dataset. The classifier takes the sequence-level embedding as input and outputs a probability score for each candidate host species. Cross-entropy loss is minimized using stochastic gradient descent. To address class imbalance (some hosts are overrepresented in public databases), weighted sampling or focal loss functions are employed.

Performance is evaluated using standard metrics including area under the receiver operating characteristic curve (AUC-ROC), precision-recall curves, and Matthews correlation coefficient. Stratified cross-validation by viral species prevents data leakage where closely related sequences from the same outbreak appear in both training and test sets.

Application to Paramyxoviruses

Molecular Basis of Paramyxovirus Host Restriction

Paramyxoviruses, including Newcastle disease virus (NDV) in poultry, canine distemper virus (CDV), and cetacean morbilliviruses, exhibit variable host ranges. The HN protein mediates receptor binding, and its globular head domain contains the sialic acid binding site. Host tropism is influenced by the specificity of HN for different sialic acid linkages (alpha-2,3 versus alpha-2,6) displayed on host epithelial cells. Poultry adapted NDV strains preferentially bind alpha-2,3 linkages, while mammalian adapted paramyxoviruses may bind alpha-2,6 or broader arrays.

A complementary restriction point exists at the level of the fusion (F) protein. Cleavage of the F0 precursor into F1 and F2 by host proteases is required for membrane fusion. The presence of a multibasic cleavage site (MBCS) containing multiple arginine and lysine residues allows cleavage by ubiquitous furin-like proteases, conferring systemic tropism. Avian paramyxoviruses with a monobasic cleavage site are restricted to respiratory and enteric tracts where trypsin-like proteases are expressed.

Foundation Model Predictions for Paramyxovirus HN Proteins

ESM-2 embeddings of HN protein sequences from 150 paramyxovirus isolates spanning avian, canine, feline, bovine, and cetacean hosts were used to train a host classifier. The model achieved an AUC-ROC of 0.94 for distinguishing poultry-adapted versus mammal-adapted strains. Attention weight analysis revealed that residues within the sialic acid binding pocket (positions 180-260 in NDV HN numbering) contributed disproportionately to classification decisions. Residue substitutions at position 193 (from glutamic acid to lysine) were consistently associated with expanded mammalian tropism.

ProtGPT2 generation experiments conditioned on sequences from poultry-adapted strains produced variants with amino acid substitutions at positions known to alter sialic acid linkage specificity. Generated sequences showed a propensity for mutations at residue 216 (aspartic acid to asparagine), which is associated with increased affinity for alpha-2,6 sialic acid. These predictions were validated by molecular docking simulations showing reduced steric clash between mutant HN and the mammalian receptor analog.

Application to Coronaviruses

Coronavirus Spike Protein and Receptor Binding

Coronaviruses exhibit complex host tropism patterns. The spike protein S1 subunit contains the RBD, which directly interfaces with host receptors. In alphacoronaviruses such as Feline Coronavirus and Canine Coronavirus, receptor usage varies. Feline coronavirus utilizes feline aminopeptidase N (fAPN), while canine coronavirus uses canine APN. Cross-species transmission between cats and dogs is limited by structural incompatibilities at the RBD-APN interface.

Betacoronaviruses including Bovine Coronavirus use 9-O-acetylated sialic acid as a primary receptor. Rabbit Coronavirus also binds to sialic acid but with distinct linkage specificity. The spike protein of gammacoronaviruses such as infectious bronchitis virus (IBV) in poultry binds to alpha-2,3-linked sialic acid. These differences in receptor engagement are encoded in the sequence and structure of the RBD.

ESM-2 Predictions for Coronavirus Host Range

A dataset of 300 coronavirus spike sequences from avian, bovine, canine, feline, porcine, and lagomorph hosts was compiled. ESM-2 embeddings were extracted and used to train a multiclass host classifier. The model achieved 87% accuracy in predicting the correct host genus. Misclassifications predominantly occurred between closely related hosts such as cats and dogs, reflecting the high sequence similarity of their APN receptors.

Residue-level attention mapping identified a 40-residue region within the RBD (corresponding to positions 480-520 in the spike protein) as the primary determinant of host specificity. Within this region, residues at positions 493, 498, and 501 were critical. For example, feline coronavirus isolates with a mutation at residue 493 (from tyrosine to histidine) showed reduced binding to fAPN and enhanced binding to cAPN in vitro, aligning with the model's attention weights.

ProtGPT2 for Adaptive Variant Enumeration

ProtGPT2 was used to generate 5,000 variant spike RBD sequences conditioned on a feline coronavirus seed sequence. The generated sequences were filtered for stability using ESMFold predicted local distance difference test (pLDDT) scores. Approximately 12% of generated variants maintained high structural confidence (pLDDT greater than 0.8). Among these, variants containing substitutions at residue 498 (from leucine to isoleucine or valine) were enriched. Molecular dynamics simulations indicated that these substitutions reduced the binding free energy to fAPN while increasing affinity for cAPN, suggesting a potential canine adaptation pathway.

Comparison with Classical Machine Learning Approaches

Random Forest Models for Host Tropism

Before the advent of foundation models, feature-based machine learning algorithms were the standard approach. For example, random forest classifiers trained on amino acid composition, dipeptide frequencies, and predicted physicochemical properties of viral proteins achieved moderate success in predicting influenza A virus host tropism [1]. These models relied on handcrafted features that captured global sequence properties but lacked the capacity to represent context-dependent interactions.

Performance Advantages of Foundation Models

Foundation models offer several advantages over classical approaches. First, they eliminate feature engineering by learning representations directly from sequence data. Second, they capture long-range dependencies in protein sequences that are missed by sliding window methods. Third, they incorporate evolutionary information from the entire protein universe, not just viral sequences. In head-to-head comparisons on paramyxovirus and coronavirus datasets, ESM-2 based classifiers outperformed random forest models by 12-18% in AUC-ROC, with the largest gains observed for viral families with low sequence diversity in training sets.

Limitations and Challenges

Despite these advantages, biological foundation models face important limitations. They require substantial computational resources for inference; processing a single viral proteome may take several minutes on a graphics processing unit. They are sensitive to input sequence quality and length; truncated or frameshifted sequences produce unreliable embeddings. Additionally, foundation models may overfit to taxonomic signals rather than functional determinants of host tropism. For example, a model could learn to classify sequences by host based on phylogenetic signature (i.e., viruses from birds are more similar to each other than to viruses from mammals) rather than genuine receptor binding information. Careful cross-validation and ablation studies are necessary to ensure that predictions are driven by biologically meaningful features.

Workflow Diagram

flowchart TD
    A[Viral Glycoprotein Sequences], > B[Preprocessing: Length Standardization, Truncation]
    B, > C[Feature Extraction: ESM-2 or ProtGPT2]
    C, > D[Embedding Vector: 2,560 Dimensions]
    D, > E[Dimensionality Reduction: PCA or UMAP]
    E, > F[Supervised Classifier: Linear or Neural Network]
    F, > G[Host Species Probability Scores]
    
    H[Host Annotation Database], > I[Curated Training Labels]
    I, > F
    
    J[ProtGPT2 Generation], > K[Filtered Variant Library]
    K, > L[ESMFold pLDDT Scoring]
    L, > M[Stable Variant Selection]
    M, > N[Molecular Dynamics Validation]
    
    C -.-> O[Attention Weights: Residue Importance Mapping]
    O, > P[Biologically Interpretable Features]

Future Directions and Veterinary Implications

Integration with Structural Biology

Current foundation models operate primarily on sequence data, but emerging architectures incorporate structural information. ESM-3, an extension of the ESM family, jointly models sequence, structure, and function. For host tropism prediction, this could allow direct modeling of receptor-viral protein interfaces without requiring crystallographic data. Predicted structures from ESMFold could be used to compute binding energy landscapes across host receptor orthologs.

Real-Time Genomic Surveillance

Biological foundation models can be integrated into genomic surveillance pipelines for emerging viruses. When a novel paramyxovirus or coronavirus is detected in wildlife or livestock, its spike or HN sequence can be extracted from raw sequencing data and processed through a pre-trained host classifier within hours. This provides immediate risk assessment for veterinary authorities. The approach is being piloted for surveillance of Avian Influenza A(H5N1) in Dairy Cattle and could be extended to other zoonotic threats.

Expanding Training Data with Experimental Validation

The predictive power of foundation models is ultimately limited by the quality of training data. Targeted experimental studies are needed to fill host-pathogen knowledge gaps. For example, cell lines derived from diverse veterinary species (equine, porcine, ovine, caprine) can be used to assess viral permissiveness in vitro, generating high-confidence labels for model training. These data, combined with foundation model predictions, form an iterative cycle of prediction and validation.

References

[1] Eng CL, Tong JC, Tan TW. Predicting host tropism of influenza A virus proteins using random forest. BMC Med Genomics. 2014. https://pubmed.ncbi.nlm.nih.gov/25521718/