Information Extraction from Semi-Structured Websites

Lockard, Colin

Information Extraction from Semi-Structured Websites

Files

Lockard_washington_0250E_22994.pdf (2.32 MB)

Date

2021-08-26

relationships.isAuthorOf

Lockard, Colin

Abstract

The World Wide Web contains countless semi-structured websites, which present information via text embedded in rich layout and visual features. These websites can be a source of information for populating knowledge bases if the facts they present can be extracted and transformed into a structured form, a goal that researchers have pursued for over twenty years. A fundamental opportunity and challenge of extracting from these sources is the variety of signals that can be harnessed to learn an extraction model, from textual semantics to layout semantics to page-to-page consistency of formatting. Extraction from semi-structured sources has been explored by researchers from the natural language processing, data mining, and database communities, but most of this work uses only a subset of the signals available, limiting their ability to scale solutions to extract from the large number and variety of such sites on the Web. In this thesis, we address this problem with a line of research that advances the state of semi-structured extraction by taking advantage of existing knowledge bases, as well as using modern machine learning methods to build rich representations of the textual, layout, and visual semantics of webpages. We present a suite of methods that will enable information extraction from semi-structured sources, addressing scenarios that include both closed and open domain information extraction and varying levels of prior knowledge about a subject domain.