Full Citation
Title: Cleaning Data with Forbidden Itemsets
Citation Type: Miscellaneous
Publication Year: 2016
ISBN:
ISSN:
DOI:
NSFID:
PMCID:
PMID:
Abstract: Methods for cleaning dirty data typically rely on additional information about the data, such as user-specified constraints that specify when a database is dirty. These constraints often involve domain restrictions and illegal value combinations. Traditionally, a database is considered clean if all constraints are satisfied. However, many real-world scenarios only have a dirty database available. In such a context, we adopt a dynamic notion of data quality, in which the data is clean if an error discovery algorithm does not find any errors. We introduce forbidden itemsets which capture unlikely value co-occurences in dirty data, and we derive properties of the lift measure to provide an efficient algorithm for mining low lift forbidden itemsets. We further introduce a repair method which guarantees that the repaired database does not contain any low lift forbidden itemsets. The algorithm uses nearest neighbour imputation to suggest possible repairs. Optional user interaction can easily be integrated into the proposed cleaning method. Evaluation on realworld data shows that errors are typically discovered with high precision, while the suggested repairs are of good quality and do not introduce new forbidden itemsets, as desired.
Url: http://adrem.uantwerpen.be/sites/adrem.uantwerpen.be/files/ForbiddenItemsetsICDE.pdf
User Submitted?: No
Authors: Rammelaere, Joeri; Geerts, Floris; Goethals, Bart
Publisher: University of Antwerp
Data Collections: IPUMS USA
Topics: Other
Countries: