Issue #86: Geometric Properties of the Voronoi Tessellation in Latent Semantic Manifolds of Large Language Models

Subscribe to Protein Design Digest

Daily curated signals from arXiv, PubMed, and BioRxiv.

Signal of the Day

Geometric Properties of the Voronoi Tessellation in Latent Semantic Manifolds of Large Language Models

Language models operate on discrete tokens but compute in continuous vector spaces, inducing a Voronoi tessellation over the representation manifold. We study this tessellation empirically on Qwen3.5-4B-Base, making two contributions. First, using float32 margin recomputation to resolve bfloat16 quantization artifacts, we validate Mabrok’s (2026) linear scaling law of the expressibility gap with $R^2$ = 0.9997 - the strongest confirmation to date - and identify a mid-layer geometric ambiguity regime where margin geometry is anti-correlated with cross-entropy (layers 24-28, $ρ$ = -0.29) before crystallizing into alignment at the final layer ($ρ$ = 0.836). Second, we show that the Voronoi tessellation of a converged model is reshapable through margin refinement procedures (MRP): short post-hoc optimization runs that widen token-decision margins without retraining. We compare direct margin maximization against Fisher information distance maximization across a dose-response sweep. Both methods find the same ceiling of ~16,300 correctable positions per 256K evaluated, but differ critically in collateral damage. Margin maximization damage escalates with intervention strength until corrections are overwhelmed. Fisher damage remains constant at ~5,300 positions across the validated range ($λ$ = 0.15-0.6), achieving +28% median margin improvement at $λ$ = 0.6 with invariant downstream benchmarks - a geometric reorganization that compresses the expressibility gap while preserving its scaling law. However, frequency and token-class audits reveal that gains concentrate in high-frequency structural tokens (84% of net corrections at $λ$ = 0.6), with content and entity-like contributions shrinking at higher $λ$. Fisher MRP is therefore a viable geometric polishing tool whose practical ceiling is set not by aggregate damage but by the uniformity of token-level benefit.

Why this matters:

Also Worth Reading

Evaluating zero-shot prediction of monomeric protein design success by AlphaFold, ESMFold, and ProteinMPNN.

De novo protein design has enabled the creation of proteins with diverse functionalities that are not found in nature. Despite recent advances, experimental success rates remain inconsistent and context-dependent, posing a bottleneck for broader applications of de novo design. To overcome this, structure and sequence prediction models have been applied to assess design quality prior to experimental testing to save time and resources. In this study, we examined the extent to which AlphaFold, Protein MPNN, and ESMFold can discriminate between experimentally successful and unsuccessful designs. We first curated a benchmark dataset of 614 experimentally characterized de novo designed monomers from 11 different design studies between 2012 and 2021. All predictive models demonstrated moderate ability to discriminate experimental successes (expressed, soluble, monomeric, and fold with the correct secondary structure) from failures. Still, many failed designs have better confidence metrics than successful designs, and confidence metrics were topology-dependent. Among all computational models evaluated, ESMFold average predicted local-distance difference test (pLDDT) yielded the best individual performance at distinguishing between successful and unsuccessful designs. A logistic regression model combining all confidence metrics provided only modest improvement over ESMFold pLDDT alone. Overall, these results show that these models can serve as an initial filtering strategy prior to experimental validation; however, their utility at accurately predicting experimentally successful designs remains limited without task-specific training.

Comprehensive Molecular Docking and Molecular Dynamics Reveal Inhibitors of HER2 L755S, T798I, and T798M based on a Large Database of Curcumin Derivatives.

Objective This study presents a methodology employing virtual screening to identify curcumin derivatives with selective affinity for the HER2 mutations L755S, T798I, and T798M. Methods Curcumin derivatives were retrieved from the ChEMBL database and filtered using KNIME. HER2 mutations were modeled in silico using MOE software with PDB ID 3RCD. Molecular docking and dynamics simulations were conducted to screen high-affinity compounds and evaluate binding interactions. Result From 505 curcumin derivatives, the RDKit module implemented in KNIME successfully filtered 317 compounds. Subsequent molecular docking against wild-type HER2 identified 100 curcumin derivatives with low docking scores, among which the top 20 compounds exhibited better binding affinities than Lapatinib. Further molecular docking screening against the three HER2 mutations identified five lead compounds with the lowest docking scores. Molecular docking and molecular dynamics simulation revealed critical binding interactions with residues essential for kinase domain stability. Chemical structural analysis revealed key modifications, such as geranyl and tripeptide modifications. CHEMBL3758656 and CHEMBL3827366, two curcumin derivatives, demonstrated consistent binding across HER2 mutations and a favorable ADMET profile. Conclusion This study successfully identified CHEMBL3758656 and CHEMBL3827366 as promising HER2 inhibitors through comprehensive virtual screening. Their high binding affinity against L755S, T798I, and T798M mutations and favorable ADME and toxicity properties underscore their potential as alternative therapeutics for HER2-positive breast cancer.

Deep learning on protein language model embeddings unlocks accurate prediction of protein solubility.

Protein misfolding is a major limitation in prokaryotic expression systems, which lack post-translational modifications and exhibit distinct intracellular environments. This severely hinders the functional expression of many heterologous proteins, especially in Escherichia coli. Accurate prediction of protein solubility is crucial for synthetic biology and protein engineering but remains a challenging task. Here, we present DeepSolNet, a deep learning model that leverages advanced protein language models to enhance solubility prediction. DeepSolNet adopts a multi-module architecture, integrating contextual embeddings from ESM Cambrian with bidirectional long short-term memory networks, convolutional neural networks, and attention mechanisms. On the validation set, DeepSolNet achieved an accuracy of 0.75 and a Matthews correlation coefficient of 0.50 for soluble/insoluble protein classification. On an independently constructed test set containing gammabody, transglutaminase, and aldehyde dehydrogenase sequences, the model maintained high performance with an accuracy of 0.53, achieving state-of-the-art performance. Visualization analyses further showed that DeepSolNet is sensitive to key residues influencing protein solubility. These results demonstrate that DeepSolNet serves as a powerful and generalizable tool for large-scale protein design and expression optimization. The tool is freely available at https://github.com/wangxinglong1990/DeepSolNet.

Research & AI Updates

OpenFold3 Meets AMD Instinct™ GPUs: Unlocking Scalable, High-Throughput Structural Biology - AMD — OpenFold3 Meets AMD Instinct™ GPUs: Unlocking Scalable, High-Throughput Structural Biology AMD.
NVIDIA Scales AlphaFold-Multimer for Proteome-Wide Protein Complex Prediction - MEXC — NVIDIA Scales AlphaFold-Multimer for Proteome-Wide Protein Complex Prediction MEXC.
How to Accelerate Protein Structure Prediction at Proteome-Scale | NVIDIA Technical Blog - NVIDIA Developer — How to Accelerate Protein Structure Prediction at Proteome-Scale | NVIDIA Technical Blog NVIDIA Developer.
Demis Hassabis: AI Competition Gap to Widen - 36 Kr — Demis Hassabis: AI Competition Gap to Widen 36 Kr.
Congratulations Dr. Mariella Quispe-Carbajal and Dr. Lingshuang Wu on their Successful Thesis Defenses! - Stony Brook University — Congratulations Dr.

From the Industry

NVIDIA Scales AlphaFold-Multimer for Proteome-Wide Protein Complex Prediction - MEXC — NVIDIA Scales AlphaFold-Multimer for Proteome-Wide Protein Complex Prediction MEXC.
LangChain Unveils Human-AI Feedback Loop Framework for Trading Copilots - blockchain.news — LangChain Unveils Human-AI Feedback Loop Framework for Trading Copilots blockchain.news.
IPO Tracker 2026: Avalyn plots IPO to push inhaled pulmonary fibrosis pipeline through clinic - BioSpace — IPO Tracker 2026: Avalyn plots IPO to push inhaled pulmonary fibrosis pipeline through clinic BioSpace.
Companies partner to accelerate development of sugar reduction solutions - Food Business News — Companies partner to accelerate development of sugar reduction solutions Food Business News.
Avalyn plans IPO to fund ph. 3 trials of inhaled lung drugs - Fierce Biotech — Avalyn plans IPO to fund ph.
KAIST Researchers Develop AI Protein Design Technology with Nobel Laureate - Seoul Economic Daily — KAIST Researchers Develop AI Protein Design Technology with Nobel Laureate Seoul Economic Daily.
Alloy in multi-target collaboration and license deal with Biogen - The Pharma Letter — Alloy in multi-target collaboration and license deal with Biogen The Pharma Letter.

Quick Reads

Geometric Properties of the Voronoi Tessellation in Latent Semantic Manifolds of Large Language Models

Language models operate on discrete tokens but compute in continuous vector spaces, inducing a Voronoi tessellation over the representation manifold. Read more →

Exploring quantum annealing for coarse-grained protein folding.

We explore the potential application of quantum annealing to address the protein structure problem. Read more →

ME-PFP: An Ensemble Learning Approach Fusing Multi-Source Features for Protein Function Prediction.

Proteins, as essential components of living organisms, play a critical role in both drug discovery and disease mechanism research. Read more →

How artificial intelligence is reengineering protein engineering.

Over the past decades, protein engineering has matured into a field of its own, driven by computational modeling and high-throughput wet lab experiments, with broad application in therapeutics, diagnostics, agriculture, and manufacturing. Read more →

Identification of ursolic acid from Wumei as a syk-targeting anti-allergic agent using a piezoresistive cantilever biosensor.

Wumei (WM), a historical food and medicine homology fruit in China, is reported to have anti-allergic effect, yet its active components and mechanisms remain unclear. Read more →

Aromatherapy with Chrysanthemum morifolium cv. Chuju essential oil alleviates allergic rhinitis by modulating the mTOR-PPARγ signaling cascade.

Conventional treatments for allergic rhinitis (AR), such as oral medications and nasal sprays, can effectively alleviate symptoms but often cause side effects, including potential organ damage and symptom relapse after discontinuation. Read more →

In silico and in vitro analysis: Unveiling the therapeutic potential of flavonoids against KLF7 in ovarian cancer.

Introduction Ovarian cancer (OC) remains a major clinical challenge due to late-stage diagnosis, molecular heterogeneity, and the frequent development of chemoresistance, leading to poor patient outcomes. Read more →

BioMutation: a portable graphical user interface for mutagenesis and feature analysis in proteins, nucleic acids, and their complexes.

Protein and nucleic acid mutational studies are central to understanding biomolecular structure, function, and interactions, yet existing computational tools often lack user-friendly interfaces for high-throughput and systematic mutagenesis. Read more →

Pipeline Tip

Index your BigWig files before visualization to save memory.

Resources & Tools

Dataset: BioLiP - Verified biologically relevant ligand-protein interactions.
Dataset: SIFTS - Residue-level mapping between PDB, UniProt, and other resources.
Tool: MMseqs2 - Fast and sensitive sequence search and clustering suite. View all tools →
Tool: HHSuite - Remote homology detection with HMM-HMM comparison. View all tools →
Event: Protein Design Hub (LinkedIn Group) (Ongoing)
Event: Structural Biology Events (Open)
Job: Adverum Biotechnologies, Inc. - Associate Director of External QA (Finished Goods) (Contract) - jobs.lever.co at Lever
Job: Arcadia Science - Platform Scientist: Protein Evolution - jobs.lever.co at Lever Jobs

The protein structure is the language of life; design is its poetry. — Recep Adiyaman

Building something in Protein Design?