From Evidence to Knowledge: A Hierarchical Probabilistic Model of the Scientific Knowledge Landscape at Web Scale
Abstract
Scientific literature contains essential but often fragmented and conflicting evidence, a permanent challenge brought into focus by the emergence of Large Language Models (LLMs) that can read and extract information at web-scale. Traditional methods for knowledge integration rely on knowledge graphs that treat extracted statements as deterministic facts, imposing rigid assumptions such as the closed-world assumption and independence of relationships, which fail to capture uncertainty or reconcile contradictions.We introduce a shift from deterministic fact aggregation to a probabilistic framework that models article-level evidence as noisy, partial observations of a latent hierarchical structure.Applied to a biomedical corpus, our method synthesizes article-level evidence to form stable and biologically coherent clusters, indicating that stable signals can be extracted even when inputs are sparse, biased, or unreliable.