Synthesizing data extraction scripts from user examples
University:New York University - Abu Dhabi
Information extraction approaches can be broadly classified into rule-based or machine learning approaches. Annotators or extractors built using techniques from such approaches typically target the far ends of the spectrum of data extraction tasks: Rule-based techniques focus on well-structured text such as directory listings, machine generated logs, web pages and build parsers that can extract with great accuracy most of the data. In contrast, machine learning techniques combined with natural language text analysis tools extract information from free-form text such as online product reviews, comments or posts to determine sentiment and discussion subjects, e.g. “what product”. In the middle of this spectrum are text collections of semi-structured text such as a collection of resumes, a collection of research papers, biographical dictionaries, historical military journals, historical narratives and so on. These collections have sufficient structure within and across documents that can be well expressed with rule-based extractors. However, enough discrepancies exist across the collection such that the process of manually building extractors is not cost-effective. NLP and machine learning approaches tend to miss most of the inherent structure entirely. Building on Gulwani et al.’s FlashExtract work on synthesizing extractors for well-structured documents, we build Texture, a system that extracts data from semi-structured collections. Texture differs from existing work in the following ways:
1) The structure specification language uses higher-level constructs than those used by traditional regex patterns by relying on user constructed dictionaries and NERs.
2) Introduces new pattern expressions that consider the variance across collections.
3) Supports extract-to-structure semantics where structure can be relational, nested or graph
4) Is a mixed-initiative user interface