Full Citation
Title: Summarizing Semi-Structured Data
Citation Type: Dissertation/Thesis
Publication Year: 2020
ISBN:
ISSN:
DOI:
NSFID:
PMCID:
PMID:
Abstract: Data sources, e.g., Yelp or Twitter, that produce records without pre-defined schema (semi-structured) are popular nowadays. However, semi-structured data can be unwieldy for data exploration purposes due to an extremely large number of records as well as an accumulative number of attributes. Finding a more compact data representation (i.e., summary encoding) is thus critical before subsequent data analysis can be applied. Designing an appropriate summary encoding requires a trade-off between compactness and fidelity. In this thesis, a framework is developed for reasoning about the trade-off between summary compactness and fidelity. A measure of summary fidelity is proposed accordingly. Analytic and experimental evidences have been presented, which show that the proposed fidelity measure is not only efficiently computable, but also a meaningful measure of summary quality. To efficiently construct a high-fidelity summary encoding given a limitation on its verboseness, a clustering-based approach is identified and experiment results show that it produces results orders of magnitude faster and competitive with more powerful techniques for compression and summarization.
Url: https://www.proquest.com/docview/2417290006?pq-origsite=gscholar&fromopenview=true
User Submitted?: No
Authors: Xie, Ting
Institution: State University of New York at Buffalo
Department: Computer Science and Engineering
Advisor:
Degree:
Publisher Location:
Pages:
Data Collections: IPUMS USA
Topics: Methodology and Data Collection
Countries: