A scalable platform to build the data layer of knowledge graph AI
Abstract
Knowledge graphs (KGs) underpin modern graph AI, from retrieval-augmented generation to large graph-language models. However, pipelines to construct and maintain KGs remain irreproducible and challenging to scale. We introduce Optimus, an opinionated platform for building large-scale KGs with an emphasis on reproducibility and extensibility. Optimus adopts a data lake-inspired medallion architecture; enforces schema contracts and identifier harmonization; and produces machine learning-ready KG exports. In benchmarking experiments, Optimus constructed a biomedical KG with 192,307 nodes, 21.5M edges, and 88.6M properties from 47 heterogeneous datasets. Parallelized execution reduced wall clock build time by 56.5% compared to sequential execution (143.6 s vs. 62.4 s), while throughput per edge improved as the graph scaled. These results demonstrate that Optimus enables efficient, reproducible, and scalable KG construction, strengthening the data layer of knowledge-grounded AI.