3DGrid-LLM: Token-Level Fusion of Language and 3D Grids for Chemical Multimodal Generation
Abstract
We introduce 3DGrid-LLM, an early-fusion multimodal foundation model that integrates natural language with 3D density grids for molecular and materials science. The model extends a large decoder-only language model with discrete volumetric tokens from a 3D VQGAN, enabling unified token-level processing of spatial and textual information. Trained on diverse datasets, 3DGrid-LLM supports bidirectional generation, multimodal question answering, and retrieval-augmented 3D grid generation. Experiments show consistent improvements over baselines in multimodal VQA, semantic text generation, and property-aligned retrieval, demonstrating accurate and physically consistent outputs. This work establishes a scalable framework for incorporating physically grounded volumetric data into language models.