Expo Talk Panel
Data Scout: “From Prompt to Corpus: Accelerating Domain-Specific Data Collection with LLMGuided Discovery”
Chirag Garg
Exhibit Hall G,H
Training large language models for specialized disciplines such as advanced mathematics, molecular biology, or legal reasoning is limited by the scarcity of large, high quality, domain specific corpora. Most publicly available datasets are dominated by general purpose web text. When available, specialized data are fragmented across diverse sources such as preprints, conference papers, forums, lecture notes, and digitized books. No single source offers comprehensive real-world coverage across scientific domains. Consequently, scaling up authentic domain data remains a bottleneck: collecting a subset of relevant tokens often requires downloading and filtering hundreds of terabytes of raw web material, a process that is both time consuming and costly. x000D x000D We introduce Data Scout, a modular, LLM powered pipeline that turns a high-level user intent (e.g., “I need data for advanced mathematics”) into a vetted, list of seed URLs in minutes. The system first expands the original intent using an LLM that generates a hierarchical subgraph of related concepts; this taxonomy drives a diversified set of search queries that systematically cover the target domain while respecting known licensing signals. Candidate URLs are then filtered by the same LLM using chain of thought prompting based on topical relevance, licensing clarity, and crawlability. Our results show that that the list of selected candidate URLs when crawled can yield a high percentage of relevant pages (40%+) related to the user’s intended topic or query, compared to less than 1 percent in general web-scale corpora. Data Scout is available with both CLI and GUI front ends. By democratizing domain specific data acquisition, Data Scout enables researchers without dedicated crawling infrastructure to bootstrap large, high-fidelity corpora, accelerating the development of specialized LLMs across various niche domains fields.
Live content is unavailable. Log in and register to view live content