Information Extraction from Semi-Structured Websites

dc.contributor.advisorHajishirzi, Hannaneh
dc.contributor.authorLockard, Colin
dc.date.accessioned2021-08-26T18:08:37Z
dc.date.available2021-08-26T18:08:37Z
dc.date.issued2021-08-26
dc.date.submitted2021
dc.descriptionThesis (Ph.D.)--University of Washington, 2021
dc.description.abstractThe World Wide Web contains countless semi-structured websites, which present information via text embedded in rich layout and visual features. These websites can be a source of information for populating knowledge bases if the facts they present can be extracted and transformed into a structured form, a goal that researchers have pursued for over twenty years. A fundamental opportunity and challenge of extracting from these sources is the variety of signals that can be harnessed to learn an extraction model, from textual semantics to layout semantics to page-to-page consistency of formatting. Extraction from semi-structured sources has been explored by researchers from the natural language processing, data mining, and database communities, but most of this work uses only a subset of the signals available, limiting their ability to scale solutions to extract from the large number and variety of such sites on the Web. In this thesis, we address this problem with a line of research that advances the state of semi-structured extraction by taking advantage of existing knowledge bases, as well as using modern machine learning methods to build rich representations of the textual, layout, and visual semantics of webpages. We present a suite of methods that will enable information extraction from semi-structured sources, addressing scenarios that include both closed and open domain information extraction and varying levels of prior knowledge about a subject domain.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherLockard_washington_0250E_22994.pdf
dc.identifier.urihttp://hdl.handle.net/1773/47422
dc.language.isoen_US
dc.rightsnone
dc.subjectinformation extraction
dc.subjectnatural language processing
dc.subjectArtificial intelligence
dc.subjectComputer science
dc.subject.otherComputer science and engineering
dc.titleInformation Extraction from Semi-Structured Websites
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Lockard_washington_0250E_22994.pdf
Size:
2.32 MB
Format:
Adobe Portable Document Format