Issue #36: Scalable embedding fusion with protein language models: insights from benchmarking text-integrated representations.
Protein Design Digest - 2026-01-30 - Scalable embedding fusion with protein language models: insights from benchmarking text-integrated representations.

Building something in Protein Design?
I love collaborating on new challenges. Let's build together.
Subscribe to Protein Design Digest
Daily curated signals from arXiv, PubMed, and BioRxiv.
Signal of the Day
Scalable embedding fusion with protein language models: insights from benchmarking text-integrated representations.
Protein language models (pLMs) have become essential tools in computational biology, powering diverse applications from variant effect prediction to protein engineering. Central to their success is the use of pretrained embeddings-contextualized representations of amino acid sequences-which enable effective transfer learning, especially in data-scarce settings. However, recent studies have revealed that standard masked language modeling objectives used to train these models often produce representations that are misaligned with the needs of downstream tasks. While scaling up model size improves performance in some cases, it does not universally yield better representations. In this study, we investigate two complementary strategies for improving pLM representations: (i) integrating text annotations through contrastive learning, and (ii) combining multiple embeddings via embedding fusion. We benchmark six text-integrated pLMs (tpLMs) and three large-scale pLMs across six biologically diverse tasks, showing that no single model dominates across settings. Fusion of multiple tpLMs embeddings improves performance on most tasks but presents a computational bottleneck due to the combinatorial number of possible combinations. To overcome this, we propose greedier forward selection, a linear-time algorithm that efficiently identifies near-optimal embedding subsets. We validate its utility through two case studies, homologous sequence recovery and protein-protein interaction prediction, demonstrating new state-of-the-art results in both. Our work highlights embedding fusion as a practical and scalable strategy for improving protein representations.
Why this matters: Provides actionable mutations to enhance catalytic efficiency or thermostability.
Also Worth Reading
Decrypting potential mechanisms linking ochratoxin A to hepatocellular carcinoma: an integrated approach combining toxicology, machine learning, molecular docking, and molecular dynamics simulation.
Background Ochratoxin A (OTA), a common food-borne mycotoxin, is a potential human carcinogen, yet the specific molecular mechanisms linking it to hepatocellular carcinoma (HCC) remain unclear. Methods We integrated network toxicology to predict OTA targets and intersected them with HCC transcriptomic data to identify key candidate genes. Functional enrichment analysis was then conducted. Multiple machine learning algorithms were applied to screen and validate core genes. Furthermore, molecular docking and molecular dynamics (MD) simulations were employed to evaluate the binding stability between OTA and key target proteins. Results A total of 50 key genes were identified as potential targets for potential OTA-associated hepatocarcinogenesis. Enrichment analysis revealed their significant involvement in critical processes such as xenobiotic metabolism and oxidative stress response. Machine learning analysis prioritized eight core genes (AURKA, GABARAPL1, CA2, PARP1, LMNA, SLC27A5, EPHX2, and GSTP1), and a combined diagnostic model demonstrated outstanding performance (AUC = 0.986). Structural analyses via molecular docking and MD simulations confirmed stable binding interactions between OTA and these core targets. Conclusions This integrated computational study identifies a set of candidate genes through which OTA may potentially interact with HCC-associated molecular networks. The robust binding predicted between OTA and the core targets provides a structural basis for these interactions. These findings offer a prioritized list of targets and a theoretical framework for subsequent experimental validation and investigation into OTA’s toxicological role in HCC.
Study on the Mechanism of Ku Diding in the Treatment of Diabetes based on Network Pharmacology, Molecular Docking Technology, and Molecular Dynamics.
Introduction To explore how Ku Diding (KDD) works in managing Diabetes Mellitus (DM), researchers utilized network pharmacology, molecular docking, and molecular dynamics methodologies. Methods Key active components of KDD were identified using the Traditional Chinese Medicine Systematic Pharmacology Database and Analysis Platform (TCMSP). Data for diabetesrelated targets were retrieved from the Human Genetic Comprehensive Databases (Genecards) and the Online Mendelian Inheritance in Man (OMIM) database. The intersection of these targets was analyzed to determine potential therapeutic targets for diabetes treatment. Proteinprotein interaction networks (PPI) were constructed using the STRING database and Cytoscape software, followed by Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. Molecular docking between the components and key targets was performed using the AutoDock Vina platform. Results This study identified that Dihydrosanguinarine, (S)-Scoulerine, among others, are the main active ingredients of KDD for treating DM, showing high affinity for critical targets like PTGS2 and PRKACA, through multiple pathways including vascular regulation, neuromodulation, metabolic regulation, and endocrine regulation. The molecular docking results showed that there are interactions between the active ingredients and the key targets, with the majority of the effective components exhibiting a stronger binding affinity than Metformin. Among them, (S)-Scoulerine and Dihydrosanguinarine demonstrated high docking affinity with the key target proteins PTGS2 and PRKACA. Discussion DM is closely linked to oxidative stress, chronic inflammation, and insulin signaling dysregulation. This study reveals that KDD exerts anti-diabetic effects via a multi-target network involving proteins such as PRKACA, PTGS2, ESR1, FOS, and DRD2. These targets are associated with glucose metabolism, inflammation, oxidative stress, and neural regulation. Modulation of these pathways likely enhances insulin sensitivity, lowers blood glucose, suppresses inflammation, and protects against oxidative damage. GO and KEGG analyses further indicate involvement in MAPK signaling, synaptic transmission, and vascular regulation, forming a multidimensional “metabolism-inflammation-neural” regulatory network. Compared to Metformin, most KDD-derived compounds showed stronger binding, highlighting their therapeutic potential. Molecular dynamics simulations support the stability of the observed binding conformations, suggesting their potential as therapeutic targets. These findings underscore KDD’s ability to simultaneously target multiple pathological mechanisms, offering a holistic treatment strategy for DM. Conclusion This study provides preliminary evidence that KDD is characterized by a multicomponent, multi-target, and multi-pathway approach in the treatment of diabetes mellitus (DM), thereby establishing a scientific foundation for further in-depth exploration of KDD’s molecular mechanisms.
Unraveling the Interplay of D-2-HG in Glioblastoma Tumorigenesis via Integrated Machine Learning and Molecular Docking Analysis.
Glioblastoma (GBM) is an exceptionally aggressive type of brain tumor with a poor prognosis, underscoring the urgent need to identify new molecular targets for therapeutic development. The objective of this research is to clarify the molecular interactions affected by the oncometabolite D-2-hydroxyglutarate (D-2-HG) within the framework of GBM. Differential expression analysis of multi-omics data identified potential target genes linked to GBM pathogenesis. To enhance our understanding of the binding interactions between D-2-HG and the identified target proteins, we utilized an integrated methodology encompassing various machine learning algorithms, network pharmacology techniques, and molecular docking. A sum of 135 genes was recognized as possible targets through which D-2-HG exerts its effects in GBM. The ensuing analysis, utilizing machine learning techniques, identified six crucial genes [eukaryotic translation initiation factor 4E binding protein 1 (EIF4EBP1), fatty acid binding protein 3 (FABP3), potassium voltage-gated channel subfamily Q member 2 (KCNQ2), epithelial cell adhesion molecule (EPCAM), sphingosine-1-phosphate receptor 5 (S1PR5), and metabotropic glutamate receptor 3 (GRM3)] as key regulators. Among these, FABP3, KCNQ2, EPCAM, S1PR5, and GRM3 were significantly downregulated, whereas EIF4EBP1 was markedly upregulated (p < 0.05). Molecular docking simulations indicated a strong binding affinity of D-2-HG towards the target proteins. Our study suggests that D-2-HG plays a significant role in the pathogenesis of GBM by modulating specific genes and signaling pathways. Utilizing machine learning techniques, we identified six essential regulatory genes, and further molecular docking simulations revealed a strong affinity of D-2-HG for these critical targets. Collectively, these results establish a substantial basis for future investigations into the mechanistic role of D-2-HG in GBM oncogenesis.
Research & AI Updates
- With AlphaGenome, Researchers Are Using A.I. to Decode the Human Blueprint - The New York Times — With AlphaGenome, Researchers Are Using A.I.
- Heme Biosynthesis is controlled by reversible feedback mechanism inside the mitochondrial matrix - Vanderbilt University — Heme Biosynthesis is controlled by reversible feedback mechanism inside the mitochondrial matrix Vanderbilt University.
- New AI model predicts gene function in DNA’s vast ‘dark genome’ - 동아사이언스 — New AI model predicts gene function in DNA’s vast ‘dark genome’ 동아사이언스.
- ATP-Sensitive Peptide-Based Coacervates for Intracellular Delivery of Therapeutic Oligonucleotides - Frontiers — ATP-Sensitive Peptide-Based Coacervates for Intracellular Delivery of Therapeutic Oligonucleotides Frontiers.
- DeepMind’s AlphaGenome Predicts Genetic Variation Function, Including Disease - Genetic Engineering and Biotechnology News — DeepMind’s AlphaGenome Predicts Genetic Variation Function, Including Disease Genetic Engineering and Biotechnology News.
From the Industry
- J&K Ingredients, Pallas Biotech partnership spurs clean-label innovation - Commercial Baking — J&K Ingredients, Pallas Biotech partnership spurs clean-label innovation Commercial Baking.
- Biologics Contract Research Organizations Market Trends Analysis and Forecast Report 2021-2025 & 2025-2033 - GlobeNewswire — Biologics Contract Research Organizations Market Trends Analysis and Forecast Report 2021-2025 & 2025-2033 GlobeNewswire.
- Summit Therapeutics Announces U.S. FDA Acceptance of Biologics License Application (BLA) Seeking Approval for Ivonescimab in Combination with Chemotherapy in Treatment of Patients with EGFRm NSCLC Post-TKI Therapy - Business Wire — Summit Therapeutics Announces U.S.
- United States Red Biotechnology Market to Grow at 5.2% CAGR - openPR.com — United States Red Biotechnology Market to Grow at 5.2% CAGR openPR.com.
- Hangzhou Jiuge Biotech advances membrane protein production with P2X1 antibody P007 - EU Reporter — Hangzhou Jiuge Biotech advances membrane protein production with P2X1 antibody P007 EU Reporter.
- Pinnacle Food Group Subsidiary Launches 18-Month Biotech Collaboration on Precision Fermentation - TipRanks — Pinnacle Food Group Subsidiary Launches 18-Month Biotech Collaboration on Precision Fermentation TipRanks.
- China’s edge in early-stage drugmaking ‘likely to persist,’ Pitchbook says - BioPharma Dive — China’s edge in early-stage drugmaking ‘likely to persist,’ Pitchbook says BioPharma Dive.
Quick Reads
Identification of novel umami peptides in fermented milk and elucidation of their umami mechanism via molecular docking and molecular dynamics simulations.
A streamlined workflow integrating multi-model machine learning, bioinformatics filtering, sensory evaluation, molecular docking and dynamics simulations was applied to mine umami peptides in fermented milk. Read more →
Identification of Three Novel Umami Peptides from Metagenomics of Traditional Fermented Fish, Suanyu, and Receptor Binding Mechanism via the Graph Neural Network-Based Model and Molecular Dynamics Simulation.
Fermented fish products are vital sources of umami peptides. Read more →
Triazinyl-benzenesulfonamide derivatives as hCA IX inhibitors: Design, synthesis, and activity determination using an optimized stop-flow methodology.
This study presents the design, synthesis, and biological evaluation of a new series of 1,3,5-triazinyl benzenesulfonamide derivatives incorporating substituted piperazines, aminobenzenes, or adamantane moieties. Read more →
Influence of drying temperature on the metabolites profile and potential antioxidant pathways of Passiflora edulis peel: Integrating untargeted metabolomics with network pharmacology analyses, molecular docking, and molecular dynamics simulation.
Passiflora edulis peels consist of considerable antioxidative potential, which attributed to their diverse bioactive components. Read more →
Ginsenoside Rb1 as a multi-target modulator in heart failure: Mechanistic insights into extracellular remodeling and transcriptional pathways from network pharmacology, molecular dynamics, and binding free energy analyses.
Background Heart failure is a leading global health burden, often driven by Angiotensin II (Ang II)-induced processes such as inflammation, fibrosis, and extracellular matrix remodeling. Read more →
A Mini Review on Metal Complexes as Potential Anti-SARS-CoV-2 Agents: Insights from Molecular Docking Studies.
There is an urgent need to develop effective antiviral treatments against SARS-CoV-2. Read more →
Comparative Analysis of AlphaFold2 Models and Intrinsic Disorder Illuminates Structural Divergence as a Symptom of Functional Divergence Across the Calmodulin Superfamily.
Protein structure enables function. Read more →
Mechanisms of Bellidifolin in Treating Doxorubicin-Induced Cardiotoxicity: Network Pharmacology, Molecular Docking, and Experimental Verification.
This study aims to examine the roles and mechanisms of action of bellidifolin (BEL) in alleviating doxorubicin-mediated cardiotoxicity using network pharmacology and experimental validation . Read more →
Pipeline Tip
Index your BigWig files before visualization to save memory.
Resources & Tools
- Dataset: BioLiP - Verified biologically relevant ligand-protein interactions.
- Dataset: SIFTS - Residue-level mapping between PDB, UniProt, and other resources.
- Tool: AlphaFill - Ligand and cofactor transfer into AlphaFold models. View all tools →
- Tool: ReFOLD4 - Sophisticated protein structure refinement tool for improving model quality. View all tools →
- Event: Protein Design Hub (LinkedIn Group) (Ongoing)
- Event: Structural Biology Events (Open)
- Job: Structures of nucleotide-bound human telomerase at several steps of its telomeric DNA repeat addition cycle - Nature at Nature Careers
- Job: Single-cell atlas of the transcriptome and chromatin accessibility in the human retina - Nature at Nature Careers
The protein structure is the language of life; design is its poetry. — Recep Adiyaman