Recep Adiyaman
bioinformatics

Issue #36: Scalable embedding fusion with protein language models: insights from benchmarking text-integrated representations.

January 30, 2026 Daily Intelligence
Protein Design Daily

Building something in Protein Design?

I love collaborating on new architectural challenges. Let's build together.

🧬 Protein Design Digest

Curated protein signals by Recep Adiyaman

Join 1,000+ researchers. Unsubscribe anytime.

🚀 Today’s Top Signal

Scalable embedding fusion with protein language models: insights from benchmarking text-integrated representations.

🧬 Abstract

Protein language models (pLMs) have become essential tools in computational biology, powering diverse applications from variant effect prediction to protein engineering. Central to their success is the use of pretrained embeddings-contextualized representations of amino acid sequences-which enable effective transfer learning, especially in data-scarce settings. However, recent studies have revealed that standard masked language modeling objectives used to train these models often produce representations that are misaligned with the needs of downstream tasks. While scaling up model size improves performance in some cases, it does not universally yield better representations. In this study, we investigate two complementary strategies for improving pLM representations: (i) integrating text annotations through contrastive learning, and (ii) combining multiple embeddings via embedding fusion. We benchmark six text-integrated pLMs (tpLMs) and three large-scale pLMs across six biologically diverse tasks, showing that no single model dominates across settings. Fusion of multiple tpLMs embeddings improves performance on most tasks but presents a computational bottleneck due to the combinatorial number of possible combinations. To overcome this, we propose greedier forward selection, a linear-time algorithm that efficiently identifies near-optimal embedding subsets. We validate its utility through two case studies, homologous sequence recovery and protein-protein interaction prediction, demonstrating new state-of-the-art results in both. Our work highlights embedding fusion as a practical and scalable strategy for improving protein representations.

Why it matters: Provides actionable mutations to enhance catalytic efficiency or thermostability.


⭐ Additional Signals

Decrypting potential mechanisms linking ochratoxin A to hepatocellular carcinoma: an integrated approach combining toxicology, machine learning, molecular docking, and molecular dynamics simulation.

Background Ochratoxin A (OTA), a common food-borne mycotoxin, is a potential human carcinogen, yet the specific molecular mechanisms linking it to hepatocellular carcinoma (HCC) remain unclear. Methods We integrated network toxicology to predict OTA targets and intersected them with HCC transcriptomic data to identify key candidate genes. Functional enrichment analysis was then conducted. Multiple machine learning algorithms were applied to screen and validate core genes. Furthermore, molecular docking and molecular dynamics (MD) simulations were employed to evaluate the binding stability between OTA and key target proteins. Results A total of 50 key genes were identified as potential targets for potential OTA-associated hepatocarcinogenesis. Enrichment analysis revealed their significant involvement in critical processes such as xenobiotic metabolism and oxidative stress response. Machine learning analysis prioritized eight core genes (AURKA, GABARAPL1, CA2, PARP1, LMNA, SLC27A5, EPHX2, and GSTP1), and a combined diagnostic model demonstrated outstanding performance (AUC = 0.986). Structural analyses via molecular docking and MD simulations confirmed stable binding interactions between OTA and these core targets. Conclusions This integrated computational study identifies a set of candidate genes through which OTA may potentially interact with HCC-associated molecular networks. The robust binding predicted between OTA and the core targets provides a structural basis for these interactions. These findings offer a prioritized list of targets and a theoretical framework for subsequent experimental validation and investigation into OTA’s toxicological role in HCC.

Study on the Mechanism of Ku Diding in the Treatment of Diabetes based on Network Pharmacology, Molecular Docking Technology, and Molecular Dynamics.

Introduction To explore how Ku Diding (KDD) works in managing Diabetes Mellitus (DM), researchers utilized network pharmacology, molecular docking, and molecular dynamics methodologies. Methods Key active components of KDD were identified using the Traditional Chinese Medicine Systematic Pharmacology Database and Analysis Platform (TCMSP). Data for diabetesrelated targets were retrieved from the Human Genetic Comprehensive Databases (Genecards) and the Online Mendelian Inheritance in Man (OMIM) database. The intersection of these targets was analyzed to determine potential therapeutic targets for diabetes treatment. Proteinprotein interaction networks (PPI) were constructed using the STRING database and Cytoscape software, followed by Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. Molecular docking between the components and key targets was performed using the AutoDock Vina platform. Results This study identified that Dihydrosanguinarine, (S)-Scoulerine, among others, are the main active ingredients of KDD for treating DM, showing high affinity for critical targets like PTGS2 and PRKACA, through multiple pathways including vascular regulation, neuromodulation, metabolic regulation, and endocrine regulation. The molecular docking results showed that there are interactions between the active ingredients and the key targets, with the majority of the effective components exhibiting a stronger binding affinity than Metformin. Among them, (S)-Scoulerine and Dihydrosanguinarine demonstrated high docking affinity with the key target proteins PTGS2 and PRKACA. Discussion DM is closely linked to oxidative stress, chronic inflammation, and insulin signaling dysregulation. This study reveals that KDD exerts anti-diabetic effects via a multi-target network involving proteins such as PRKACA, PTGS2, ESR1, FOS, and DRD2. These targets are associated with glucose metabolism, inflammation, oxidative stress, and neural regulation. Modulation of these pathways likely enhances insulin sensitivity, lowers blood glucose, suppresses inflammation, and protects against oxidative damage. GO and KEGG analyses further indicate involvement in MAPK signaling, synaptic transmission, and vascular regulation, forming a multidimensional “metabolism-inflammation-neural” regulatory network. Compared to Metformin, most KDD-derived compounds showed stronger binding, highlighting their therapeutic potential. Molecular dynamics simulations support the stability of the observed binding conformations, suggesting their potential as therapeutic targets. These findings underscore KDD’s ability to simultaneously target multiple pathological mechanisms, offering a holistic treatment strategy for DM. Conclusion This study provides preliminary evidence that KDD is characterized by a multicomponent, multi-target, and multi-pathway approach in the treatment of diabetes mellitus (DM), thereby establishing a scientific foundation for further in-depth exploration of KDD’s molecular mechanisms.

Unraveling the Interplay of D-2-HG in Glioblastoma Tumorigenesis via Integrated Machine Learning and Molecular Docking Analysis.

Glioblastoma (GBM) is an exceptionally aggressive type of brain tumor with a poor prognosis, underscoring the urgent need to identify new molecular targets for therapeutic development. The objective of this research is to clarify the molecular interactions affected by the oncometabolite D-2-hydroxyglutarate (D-2-HG) within the framework of GBM. Differential expression analysis of multi-omics data identified potential target genes linked to GBM pathogenesis. To enhance our understanding of the binding interactions between D-2-HG and the identified target proteins, we utilized an integrated methodology encompassing various machine learning algorithms, network pharmacology techniques, and molecular docking. A sum of 135 genes was recognized as possible targets through which D-2-HG exerts its effects in GBM. The ensuing analysis, utilizing machine learning techniques, identified six crucial genes [eukaryotic translation initiation factor 4E binding protein 1 (EIF4EBP1), fatty acid binding protein 3 (FABP3), potassium voltage-gated channel subfamily Q member 2 (KCNQ2), epithelial cell adhesion molecule (EPCAM), sphingosine-1-phosphate receptor 5 (S1PR5), and metabotropic glutamate receptor 3 (GRM3)] as key regulators. Among these, FABP3, KCNQ2, EPCAM, S1PR5, and GRM3 were significantly downregulated, whereas EIF4EBP1 was markedly upregulated (p < 0.05). Molecular docking simulations indicated a strong binding affinity of D-2-HG towards the target proteins. Our study suggests that D-2-HG plays a significant role in the pathogenesis of GBM by modulating specific genes and signaling pathways. Utilizing machine learning techniques, we identified six essential regulatory genes, and further molecular docking simulations revealed a strong affinity of D-2-HG for these critical targets. Collectively, these results establish a substantial basis for future investigations into the mechanistic role of D-2-HG in GBM oncogenesis.


🧪 AI & Research News

🏢 Industry Insight & Applications


⚡ Quick Reads

Identification of novel umami peptides in fermented milk and elucidation of their umami mechanism via molecular docking and molecular dynamics simulations.

A streamlined workflow integrating multi-model machine learning, bioinformatics filtering, sensory evaluation, molecular docking and dynamics simulations was applied to mine umami peptides in fermented milk. Based on dual selection criteria-(i) unanimous umami prediction by UMPred-FRL, Umami_YYDS, Umami-MRNN, Mlp4Umami, Umami_TD, (ii) favorable in silico properties (non-toxicity, non-allergenicity, good solubility, stability, potential bioactivity)-ten out of the 1505 peptides identified by peptidomics were shortlisted as umami peptide candidates. Sensory evaluation confirmed that eight imparted an umami taste. Molecular docking revealed that umami peptides interact with TAS1R1/TAS1R3 primarily through hydrogen bonds formed between their hydrophilic residues (predominantly Lys, Tyr) and receptor hydrophilic residues (notably Lys/Arg in TAS1R1, Asn in TAS1R3). Residues Arg307/Met375/Lys379 of TAS1R1, and Arg327/443/Ala329/Val437/Met452 of TAS1R3 were key interaction sites. Molecular dynamics simulations showed that the three peptides with the highest umami taste-EVFTKK, SKKTVDME, VMGVSKVKE-formed stable and compact complexes with TAS1R1/TAS1R3. This work enhances understanding of the umami characteristics of fermented milk.

Identification of Three Novel Umami Peptides from Metagenomics of Traditional Fermented Fish, Suanyu, and Receptor Binding Mechanism via the Graph Neural Network-Based Model and Molecular Dynamics Simulation.

Fermented fish products are vital sources of umami peptides. In this study, a hierarchical graph attention network-based model was developed to identify candidate umami peptides. Via an integrated approach combining metagenomics, molecular docking, attention weight analysis, molecular dynamics simulations, and experimental validation, three novel umami peptides (GYSSYK, LYSDSK, and TRTKASY) were identified from the Suanyu system, a traditional fermented fish product. It was revealed that T1R1 and T1R3 could form stable complexes with these peptides involving critical residues: GLU301, ARG277, LYS328, SER384, ASP147, GLN278, and HIS71. In sensory evaluation, candidate peptides showed high umami properties with umami threshold values of 0.28 (±0.14) mg/mL. Overall, this study presents a hierarchical graph attention network-based screening methodology for the rapid screening and in-depth study of umami peptides.

Triazinyl-benzenesulfonamide derivatives as hCA IX inhibitors: Design, synthesis, and activity determination using an optimized stop-flow methodology.

This study presents the design, synthesis, and biological evaluation of a new series of 1,3,5-triazinyl benzenesulfonamide derivatives incorporating substituted piperazines, aminobenzenes, or adamantane moieties. The compounds were tested for inhibitory activity against human carbonic anhydrase isoenzymes II and IX, aiming for selectivity towards the latter, cancer associated isoenzyme IX, on the basis of an initial molecular docking screening. Several compounds showed inhibitory activity and selectivity exceeding that of the clinical benchmark acetazolamide. Among the most effective compounds were derivatives 9 (K I = 7.4 nM, selectivity ratio = 3.4) and 38 (K I = 9.4 nM, selectivity ratio = 3.9). This activity trend was related to structural rigidity and hydrophobicity based on structure-activity analysis and molecular docking. The compound’s inhibition constants were determined using stopped-flow spectrophotometry in an updated approach, which enables accurate K I determination with higher throughput. The methodology framework is described in detail to facilitate reproducibility in the field.

Influence of drying temperature on the metabolites profile and potential antioxidant pathways of Passiflora edulis peel: Integrating untargeted metabolomics with network pharmacology analyses, molecular docking, and molecular dynamics simulation.

Passiflora edulis peels consist of considerable antioxidative potential, which attributed to their diverse bioactive components. Nevertheless, these substances are susceptible to thermal degradation which can diminish their usefulness, resulting in resource wastage. This current research explore the influence of drying under varying temperature conditions (room temperature (∼28 °C), 40°C, and 70°C) on the antioxidant properties and metabolite composition of P. edulis peel extracts. A comprehensive analytical approach was adopted, encompassing proximate analysis, vitamin C quantification, total phenolic and flavonoid determinations, free radical scavenging assays, metabolite profiling, network pharmacology, molecular docking, and molecular dynamics simulation. In this study, the content of crude fibre and primary metabolites including fat, protein and carbohydrate were shown to be affected by the elevating drying temperature. Likewise, extract of P. edulis peels dried at room temperature established significant antioxidant activity at 1 mg/mL, inhibiting 2,2-diphenyl-1-picrylhydrazyl radicals (DPPH•) by 81.20 % and 2,2’-azino-bis(3-ethylbenzothiazoline-6-sulfonic acid) radicals (ABTS⁺•) by 83.52 %. The content of secondary metabolites such as phenolics and flavonoids was also shown to be affected by temperature, which peels dried at room temperature harbour substantial phenolics and flavonoids content values, 23.71 ± 3.86 mg GAE/g and 35.43 ± 0.10 mg QE/g. The results from metabolite profiling analysis via LC-MS QTOF discovered that the room temperature extract contains 18 potential compounds, including oleamide, 6E,9E-octadecadienoic acid, C16 sphinganine, dodecanamide, and 2-hexyl-decanoic acid. Swiss Target Prediction was employed to identify hypothetical molecular targets, while oxidative stress-related targets were retrieved from the DrugBank, GeneCards, and DisGENET databases. A component-target-pathway network was constructed, encompassing 12 bioactive compounds after initial ADMET screening and 10 hub genes namely TP53, AKT1, CASP3, BCL2, STAT3, HSP90AA1, HSP90AB1, BCL2L1, ESR1, and MDM2. The identified potential antioxidant-related pathways included intrinsic apoptotic signalling, mitochondrial membrane organisation, and mitochondrial transport, among others. Structure-based virtual screening through molecular docking revealed that (S)-2-Hydroxy-2-phenylacetonitrile O-b-D-allopyranoside exhibited significant interaction with HSP90AB1, resulting in a binding affinity of -8.4 kcal/mol. These findings reinforce the pharmacological relevance of P. edulis peels as a high-value reservoir of potential antioxidant substances suitable for the development of functional foods and drugs for disease prevention and health promotion.

Ginsenoside Rb1 as a multi-target modulator in heart failure: Mechanistic insights into extracellular remodeling and transcriptional pathways from network pharmacology, molecular dynamics, and binding free energy analyses.

Background Heart failure is a leading global health burden, often driven by Angiotensin II (Ang II)-induced processes such as inflammation, fibrosis, and extracellular matrix remodeling. These mechanisms involve multiple protein hubs, making single-target drugs insufficient. Natural products such as Ginsenoside Rb1, a major bioactive constituent of Panax ginseng, have emerged as promising multi-target agents, though their mechanistic roles in cardiovascular protection remain incompletely defined. Methods A combined strategy of network pharmacology, protein-protein interaction analysis, molecular docking, molecular dynamics (MD) simulations, and MM/GBSA binding free energy calculations was employed. Hub proteins associated with Ang II-mediated heart failure were identified, followed by docking and MM/GBSA analyses to compare the binding affinity of Rb1 against reference drugs (Losartan, Enalapril, and Carvedilol). Protein-ligand interaction maps, hydrophobicity profiling, and electrostatic potential (ESP) analyses were used to elucidate binding mechanisms. Results Five hub proteins-MMP9, FN1, JUN, FGF2, and STAT3-were identified as central to Ang II-driven remodeling, inflammation, and transcriptional regulation. MM/GBSA analyses revealed consistently favorable ΔG bind values for Rb1, including -36.40 kcal/mol (FN1), -35.30 kcal/mol (STAT3), and -33.70 kcal/mol (JUN), which were comparable to or exceeded those of the reference drugs. In contrast, Rb1 showed moderate affinity at MMP9 (-31.80 kcal/mol) and FGF2 (-30.70 kcal/mol). Interaction plots demonstrated that the amphipathic nature of Rb1, with a bulky hydrophobic backbone and multiple polar hydroxyl groups, enabled multidentate hydrogen bonding, van der Waals stabilization, and π-alkyl interactions across diverse binding pockets. Hydrophobicity and ESP mapping further confirmed that Rb1 adapts effectively to both hydrophobic and polar microenvironments, explaining its broader multi-target binding capacity compared to the more structurally restricted reference drugs. Conclusion This study highlights Ginsenoside Rb1 as a promising polypharmacological candidate for heart failure, showing strong and adaptable binding to multiple Ang II-related targets.

A Mini Review on Metal Complexes as Potential Anti-SARS-CoV-2 Agents: Insights from Molecular Docking Studies.

There is an urgent need to develop effective antiviral treatments against SARS-CoV-2. Despite the availability of vaccines, drug discovery remains critical for combating emerging variants. Molecular docking studies have become a vital computational tool for identifying antiviral drugs capable of inhibiting different SARS-CoV-2 proteins. This review explores the role of metal complexes as promising viral inhibitors through in silico molecular docking approaches. The binding abilities of several coordination complexes derived from iron, copper, palladium, and zinc ions have been evaluated against major viral proteins such as the spike glycoprotein, RNA-dependent RNA polymerase (RdRp), and the main protease (Mpro), which are responsible for viral infection. Comparative docking studies of specific metal-based compounds with conventional antiviral drugs highlight their superior binding affinities and inhibitory potential. Furthermore, ADME (Absorption, Distribution, Metabolism, and Excretion) analyses, molecular dynamics simulations, and drugdelivery strategies are discussed to assess pharmacokinetics and therapeutic viability. Overall, this review emphasizes the importance of molecular docking in the rational design of metal complexes as antiviral agents and its relevance for developing effective therapeutic strategies to combat COVID-19.

Comparative Analysis of AlphaFold2 Models and Intrinsic Disorder Illuminates Structural Divergence as a Symptom of Functional Divergence Across the Calmodulin Superfamily.

Protein structure enables function. Eukaryotic genomes contain paralogous genes often encoding functionally diverse proteins forming superfamilies. As protein sequences evolve, their function may change but identifying functional divergence from sequence alone is difficult. With AlphaFold2, large-scale evolutionary analyses of protein 3D structures to identify structural divergence as a symptom of functional divergence may be possible. We investigated the structural features of 448 proteins in the calmodulin superfamily that includes many functionally divergent paralogs with conformational heterogeneity. Phylogenetic reconstruction yielded 18 main clades. Across the phylogeny, most residues in the AlphaFold2 models were predicted with high model confidence. Further, conformationally flexible clades were more disordered based on IUPred2A prediction. Clustering based on pairwise similarity of structural properties including 3D structure, and secondary structure and disorder mapped to the alignment context revealed a similar agreement with the sequence-based phylogeny except for the clades with numerous recent gene duplications. Clustering based on model confidence was less similar to the sequence-based phylogeny. Notably, AlphaFold2 frequently modeled functionally similar proteins from the same main clade into highly similar structures while the models differ more between functionally divergent main clades and within clades with extensive gene duplications, which may yield rapidly diverging sequences with unexpected co-evolutionary patterns. These results suggest that by comparing the evolutionary signals from sequence, AlphaFold2 models, and disorder across protein families, we can expand our perspective on protein structure evolution including identifying functional divergence.

Mechanisms of Bellidifolin in Treating Doxorubicin-Induced Cardiotoxicity: Network Pharmacology, Molecular Docking, and Experimental Verification.

This study aims to examine the roles and mechanisms of action of bellidifolin (BEL) in alleviating doxorubicin-mediated cardiotoxicity using network pharmacology and experimental validation . Mice with doxorubicin-induced cardiotoxicity were randomly assigned to control, model, BEL, and dexrazoxane (DEX) groups. Echocardiography, histological staining, network pharmacology, and molecular validation were employed to assess cardiac function and myocardial injury. Immunohistochemical staining, western blotting, and RT-qPCR were used to confirm predicted targets and fibrosis biomarkers. In vivo experiments demonstrated that BEL significantly improved cardiac function, as indicated by enhanced Ejection Fraction (EF) and Fractional Shortening (FS) compared to the model group (p < 0.01). BEL also notably reduced myocardial injury markers, including creatine kinase MB isoenzyme (CK-MB) and lactate dehydrogenase (LDH) (p < 0.01), and alleviated doxorubicin-induced myocardial fibrosis. Network pharmacology identified 61 common target genes for BEL and cardiotoxicity. Proteinprotein interaction (PPI) network analysis highlighted 16 core genes, including transforming growth factor (TGF)-β1. Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) enrichment analyses revealed that BEL’s action pathways were primarily linked to the PI3K-AKT signaling pathway. Molecular docking and dynamic simulations showed a strong binding affinity between BEL and the core target TGF-β1. In vivo validation confirmed that BEL significantly downregulated the expression of TGF-β1, α-smooth muscle actin (SMA), collagen I (Col I), and collagen III (Col III) in myocardial tissue (p < 0.01 or p < 0.05), while activating the PI3K-AKT signaling pathway (p < 0.01 or p < 0.05). BEL presents as a promising therapeutic candidate for cardiotoxicity, likely through its anti-fibrotic effects via the reduction of TGF-β1, α-SMA, Col I, and Col III expression, alongside regulation in the PI3K-AKT signaling pathway.

💡 Pipeline Tip

Index your BigWig files before visualization to save memory.


🛠️ Resources

The protein structure is the language of life; design is its poetry. — Recep Adiyaman

BS HF DK