Skip to yearly menu bar Skip to main content


Automated dataset extraction from SEC filings

Rohit Dube · Rohit Khandekar · Muhammad Ishaq


Automated extraction and analysis of key information from unstructured documents is a central problem in information retrieval. Businesses are often inundated with large volumes of documents like financial statements, contracts and agreements, invoices and customer lists, which are generally meant for human comprehension and consumption, and hence automation becomes non-trivial.

Currently, the information from such documents is extracted by some combination of manual work and proprietary scripts that break often as something changes, leading to low efficiency, high labor cost, and inconsistencies in the output. Investment banks, fund managers, marketing agencies, and investors spend millions to either buy the data or outsource the whole process, while the data is available publicly for free. We describe a capability for automated extraction and real-time analysis of datasets from a large corpus of documents containing running text and tables. Current version of our product works with millions of HTML documents from Securities and Exchange Commission (SEC) filings. These filings contain mandatory disclosures like financial information, executive compensation, mergers and acquisitions and key management changes from US corporations.

Our algorithm extracts information from millions of documents, normalizes and stores it in an efficient queryable format, interprets input queries and looks up relevant documents to compose an answer.

Chat is not available.