Recent advances in parsing combine context and backtracking into a single language. We use both attributes to explore semi-structured texts by supporting parser generation with experiment management and continuous visualization of partial results.
See previous Exploratory Parsing Webapp.
Datasets
We describe various server-resident datasets. For each we provide the record framing and information coding conventions as we know them and suggest approaches to extracting additional features.
Wikipedia
3,500,000 Articles from English Wikipedia dump. ★
47,200,000 Surveys for wikipedia article quality.
AboutUs
18,000,000 AboutUs domain pages.
30,000 Web Pages from .com zone file scrape. ★
3,000 Thick whois records. private
Miscellaneous
19,000 Sentences by Dickens.
43,000 Batch job steps. private
Documentation
We describe the exploratory approach we take to parsing existant corpuses. blog
We've modeled expermental workflow on learnings from our original web-based experiment manager. github
Bryan Ford describes Parsing Expression Grammars in his 2004 POPL paper. pdf