Full Citation
Title: Efficient fusion of aggregated historical data
Citation Type: Miscellaneous
Publication Year: 2017
ISBN:
ISSN:
DOI:
NSFID:
PMCID:
PMID:
Abstract: Background. In this paper, we address the challenge of recovering a time sequence of counts from aggre- gated historical data. For example, given a mixture of the monthly and weekly sums, how can we find the daily counts of people infected with flu? In general, what is the best way to recover historical counts from aggregated, possibly overlapping historical reports, in the presence of missing values? Equally importantly, how much should we trust this reconstruction? Current methods fail to handle complex cases such as miss- ing value, conflicting and overlapping report, while our method not only deal with these cases successfully, but also recover the time sequence with higher accuracy by incorporating domain knowledge. Aim. In this project, we are particularly interested in this question: how can you recover historical events form aggregated and overlapping historical reports? That is, suppose that we are interested in an un- known time sequence ⃗x = {x1, x2, . . . , xn} (daily observation of certain event), given several aggregated reports(monthly sums, yearly sums), how can we reconstruct the original sequence from them? Data. Our dataset is the Tycho dataset, which is a project at the University of Pittsburgh to advance the availability and use of public health data for science and policy making. Currently, the Project Tycho database includes data from all weekly notifiable disease reports for the United States. It dates back to 1888 and covers all the states in US. The types of diseases include measles, smallpox etc. Method. We provide H-FUSE, a novel method that solves above problems by allowing injection of domain knowledge in a principled way, and turning the task into a well-defined optimization problem utilizing regularization strategies based on knowledge of the historical events, such as smoothness and periodicity. H-FUSE has the following desirable properties: (a) Effectiveness, recovering historical data from aggregated reports with high accuracy; (b) Self-awareness, providing an assessment of when the recovery is not reliable; (c) Scalability, computationally linear on the size of the input data. Results. Experiments on the real data (epidemiology counts from the Tycho project) demonstrates that H-FUSE reconstructs the original data 30 − 80% better than the least squares method. Conclusions. We develop a way to recover a time sequence from its partial sums, by formulating it as an optimization problem with various constraints which allows the injection of domain knowledge. Our work extends the previous pseudo-inverse method, and will provide a new way to reconstruct time series from historical data with faster performance and higher accuracy.
Url: https://www.ml.cmu.edu/research/dap-papers/S17/dap-liu-zongge.pdf
User Submitted?: No
Authors: Liu, Zongge
Publisher: SIAM
Data Collections: IPUMS NHGIS
Topics: Methodology and Data Collection, Population Data Science
Countries: