For VirtualWorks, information extraction is the art of applying thousands of semantically organized local grammars to extract facts about people, events, organizations, key words, sentiments, etc. from texts.
Historically, the first information extraction systems were in fact applications of local grammar technologies to the extraction of specific semantic units like dates or geographical entities. (More details can be found in the publications mentioned in the Background to VirtualWorks section.)
The VirtualWorks approach distinguishes between two kinds of extraction operations: the characterization of “argument” structures in texts which include grammars for expressions denoting persons, organizations, geographical entities, dates and intervals, measures, and many other types and grammars for “propositional forms” (aka “predicate-argument structures”). Both of these extraction grammars are semantically typed which means that the combination of arguments and propositional forms is a smooth merging process. These facts are structured as detailed “propositional forms” whose properties are known in advance and which can combine with semantically typed entity types to give rise to billions of observed facts. VirtualWorks has developed a unique toolbox for the finite-state parsing of texts based on dedicated local grammars and specialized dictionaries with which tens of millions of words can be processed accurately in one second on a single machine!
An example of a relatively complex extraction task might be the detection of all facts mentioned in texts that express where and when a particular person went to school to obtain a specific degree in a specific discipline; with VirtualWorks’ finite-state parser it takes less than one minute to find more than 600,000 instances of such sentences in the English Wikipedia!