IPUMS.org Home Page

BIBLIOGRAPHY

Publications, working papers, and other research using data resources from IPUMS.

Full Citation

Title: Summarizing Semi-Structured Data

Citation Type: Dissertation/Thesis

Publication Year: 2020

Abstract: Data sources, e.g., Yelp or Twitter, that produce records without pre-defined schema (semi-structured) are popular nowadays. However, semi-structured data can be unwieldy for data exploration purposes due to an extremely large number of records as well as an accumulative number of attributes. Finding a more compact data representation (i.e., summary encoding) is thus critical before subsequent data analysis can be applied. Designing an appropriate summary encoding requires a trade-off between compactness and fidelity. In this thesis, a framework is developed for reasoning about the trade-off between summary compactness and fidelity. A measure of summary fidelity is proposed accordingly. Analytic and experimental evidences have been presented, which show that the proposed fidelity measure is not only efficiently computable, but also a meaningful measure of summary quality. To efficiently construct a high-fidelity summary encoding given a limitation on its verboseness, a clustering-based approach is identified and experiment results show that it produces results orders of magnitude faster and competitive with more powerful techniques for compression and summarization.

Url: https://www.proquest.com/docview/2417290006?pq-origsite=gscholar&fromopenview=true

User Submitted?: No

Authors: Xie, Ting

Institution: State University of New York at Buffalo

Department: Computer Science and Engineering

Advisor:

Degree:

Publisher Location:

Pages:

Data Collections: IPUMS USA

Topics: Methodology and Data Collection

Countries:

IPUMS NHGIS NAPP IHIS ATUS Terrapop