Omni-DNA: A Genomic Model Supporting Sequence Understanding, Long-context, and Textual Annotation
Abstract
The interpretation of genomic sequences is crucial for understanding biological processes. To handle the growing volume of DNA sequence data, Genomic Foundation Models (GFMs) have been developed by adapting architectures and training paradigms from Large Language Models (LLMs). Despite their remarkable performance in DNA sequence classification tasks, there remains a lack of systematic understanding regarding the training and task-adaptation processes of GFMs. Moreover, existing GFMs cannot achieve state-of-the-art performance on both short and long-context tasks and lacks multimodal abilities. By revisiting pre-training architectures and post-training techniques, we propose Omni-DNA, a family of models spanning 20M to 1.1B parameters that supports sequence understanding, long-context genomic reasoning, and natural-language annotation. Omni-DNA establishes new state-of-the-art results on 18 of 26 evaluations drawn from Nucleotide Transformer and Genomic Benchmarks. When jointly fine-tuning on biologically related tasks, Omni-DNA consistently outperform existing models and demonstrate multi-tasking abilities. To enable processing of arbitrary sequence lengths, we introduce SEQPACK—an adaptive compression operator that packs historical tokens into a learned synopsis using a position-aware learnable sampling mechanism, enabling transformer-based models to process ultra-long sequences with minimal memory and computational requirements. Our approach demonstrates superior performance on enhancer-target interaction tasks, capturing distant regulatory interactions at the 450kbp range more effectively than existing models. Finally, we present a new dataset termed seq2func, enabling Omni-DNA to generate accurate and functionally meaningful interpretations of DNA sequences, unlocking new possibilities for genomic analysis and discovery.