Exploratory Parsing

Recent advances in parsing combine context and backtracking into a single language. We use both attributes to explore semi-structured texts by supporting parser generation with experiment management and continuous visualization of partial results.

Datasets

We describe various server-resident datasets. For each we provide the record framing and information coding conventions as we know them and suggest approaches to extracting additional features.

Wikipedia

3,500,000 Articles from English Wikipedia dump. ★

47,200,000 Surveys for wikipedia article quality.

AboutUs

18,000,000 AboutUs domain pages.

30,000 Web Pages from .com zone file scrape. ★

3,000 Thick whois records. private

Miscellaneous

19,000 Sentences by Dickens.

43,000 Batch job steps. private

Documentation

We describe the exploratory approach we take to parsing existant corpuses. blog

We use an enhanced version of Ian Piumarta's peg/leg parser. website github

We've modeled expermental workflow on learnings from our original web-based experiment manager. github

Bryan Ford describes Parsing Expression Grammars in his 2004 POPL paper. pdf