IPUMS.org Home Page

BIBLIOGRAPHY

Publications, working papers, and other research using data resources from IPUMS.

Full Citation

Title: Cleaning Data with Forbidden Itemsets

Citation Type: Miscellaneous

Publication Year: 2016

Abstract: Methods for cleaning dirty data typically rely on additional information about the data, such as user-specified constraints that specify when a database is dirty. These constraints often involve domain restrictions and illegal value combinations. Traditionally, a database is considered clean if all constraints are satisfied. However, many real-world scenarios only have a dirty database available. In such a context, we adopt a dynamic notion of data quality, in which the data is clean if an error discovery algorithm does not find any errors. We introduce forbidden itemsets which capture unlikely value co-occurences in dirty data, and we derive properties of the lift measure to provide an efficient algorithm for mining low lift forbidden itemsets. We further introduce a repair method which guarantees that the repaired database does not contain any low lift forbidden itemsets. The algorithm uses nearest neighbour imputation to suggest possible repairs. Optional user interaction can easily be integrated into the proposed cleaning method. Evaluation on realworld data shows that errors are typically discovered with high precision, while the suggested repairs are of good quality and do not introduce new forbidden itemsets, as desired.

Url: http://adrem.uantwerpen.be/sites/adrem.uantwerpen.be/files/ForbiddenItemsetsICDE.pdf

User Submitted?: No

Authors: Rammelaere, Joeri; Geerts, Floris; Goethals, Bart

Publisher: University of Antwerp

Data Collections: IPUMS USA

Topics: Other

Countries:

IPUMS NHGIS NAPP IHIS ATUS Terrapop