ProtFunAgent: Agentic LLM Cascades for Low-Resource Protein Function Gap-Filling via Homology RAG and Ontology-Constrained Decoding
Sajib Acharjee Dip · John Choy · Liqing Zhang
Abstract
Predicting protein function is a long-standing challenge, especially for poorly characterized sequences where homology transfer is unreliable and large language models (LLMs) produce fluent but biologically imprecise annotations. Existing approaches often fail to integrate critical priors such as Gene Ontology (GO) structure or homology evidence, limiting both recall and generalization. We present \textbf{ProtFunAgent}, an agentic framework that couples LLM reasoning with biological constraints through three key innovations: (1) \emph{homology-guided retrieval-augmented generation}, where top-$k$ sequence homologs inject functional priors; (2) \emph{ontology-constrained decoding}, aligning predictions with the GO hierarchy via lexicon-aware filtering and pruning; and (3) a \emph{synthesis-and-judging cascade} of LLMs, where multiple models collaborate and self-evaluate to refine candidate summaries. This design mirrors biocurator workflows while retaining the flexibility of generative models. On UniProt-derived benchmarks, ProtFunAgent outperforms single-LLM and heuristic baselines, delivering \textbf{over $3\times$ higher hierarchical F1} and nearly doubling recall while maintaining precision. Moreover, the framework \textbf{closes more than half of the gap to oracle-level annotation}, demonstrating that embedding biological structure into agentic LLM pipelines enables scalable, ontology-faithful function prediction. ProtFunAgent provides a general blueprint for marrying symbolic constraints with generative reasoning, advancing automated protein annotation at scale.
Chat is not available.
Successful Page Load