Information Extraction from Semi-Structured Websites

Lockard, Colin

Information Extraction from Semi-Structured Websites

dc.contributor.advisor	Hajishirzi, Hannaneh
dc.contributor.author	Lockard, Colin
dc.date.accessioned	2021-08-26T18:08:37Z
dc.date.available	2021-08-26T18:08:37Z
dc.date.issued	2021-08-26
dc.date.submitted	2021
dc.description	Thesis (Ph.D.)--University of Washington, 2021
dc.description.abstract	The World Wide Web contains countless semi-structured websites, which present information via text embedded in rich layout and visual features. These websites can be a source of information for populating knowledge bases if the facts they present can be extracted and transformed into a structured form, a goal that researchers have pursued for over twenty years. A fundamental opportunity and challenge of extracting from these sources is the variety of signals that can be harnessed to learn an extraction model, from textual semantics to layout semantics to page-to-page consistency of formatting. Extraction from semi-structured sources has been explored by researchers from the natural language processing, data mining, and database communities, but most of this work uses only a subset of the signals available, limiting their ability to scale solutions to extract from the large number and variety of such sites on the Web. In this thesis, we address this problem with a line of research that advances the state of semi-structured extraction by taking advantage of existing knowledge bases, as well as using modern machine learning methods to build rich representations of the textual, layout, and visual semantics of webpages. We present a suite of methods that will enable information extraction from semi-structured sources, addressing scenarios that include both closed and open domain information extraction and varying levels of prior knowledge about a subject domain.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Lockard_washington_0250E_22994.pdf
dc.identifier.uri	http://hdl.handle.net/1773/47422
dc.language.iso	en_US
dc.rights	none
dc.subject	information extraction
dc.subject	natural language processing
dc.subject	Artificial intelligence
dc.subject	Computer science
dc.subject.other	Computer science and engineering
dc.title	Information Extraction from Semi-Structured Websites
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Lockard_washington_0250E_22994.pdf
Size:: 2.32 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering