Entity Analysis with Weak Supervision: Typing, Linking, and Attribute Extraction
MetadataShow full item record
With the advent of the Web, textual information has grown at an explosive rate. To digest this enormous amount of data, an automatic solution, Information Extraction (IE), has become necessary. Information extraction is a task of converting unstructured text strings into structured machine-readable data. The first key step of a general IE pipeline is often to analyze entities mentioned in the text before making holistic conclusions. To fully understand each entity, one needs to detect their mentions, categorize them into semantic types, connect them with their knowledge base entries, and identify their attributes as well as the relationships with others. In this dissertation, we first present the problem of fine-grained entity recognition. Unlike most traditional named entity recognition systems using a small set of entity classes, e.g., person, organization, location or miscellaneous, we define a novel set of over one hundred fine-grained entity types. In order to intelligently understand text and extract a wide range of information, it is useful to more precisely determine the semantic classes of entities mentioned in unstructured text. We formulate the recognition problem as multi-class, multi-label classification, describe an unsupervised method for collecting training data, and present the FIGER implementation. Next, we demonstrate that fine-grained entity types are closely connected with other entity analysis tasks. We describe an entity linking system whose prediction heavily relies on these types and present a simple yet effective implementation, called VINCULUM. An extensive evaluationon nine data sets, comparing VINCULUM with two state-of-the-art systems, elucidates key aspects of the system that include mention extraction, candidate generation, entity type prediction, entity coreference, and coherence. Finally, we describe an approach to acquire commonsense knowledge from a massive amount of text on the Web. In particular, a system called SIZEITALL is developed to extract numerical attribute values for various classes of entities. To resolve the ambiguity from the surface form text, we canonicalize the extractions with respect to WordNet senses and build a knowledge base on physical size for thousands of entity classes. Throughout all three entity analysis tasks, we show the feasibility of building sophisticated IE systems without a significant investment in human effort to create sufficient labeled data.