IPUMS.org Home Page

BIBLIOGRAPHY

Publications, working papers, and other research using data resources from IPUMS.

Full Citation

Title: Entity Resolution with Limited Training Data

Citation Type: Dissertation/Thesis

Publication Year: 2023

Abstract: The most powerful solutions to today’s problems can be identified when data is analysed to tell their own stories. This potential can be better exploited when data originating from heterogeneous domains are linked together to improve data quality and facilitate effective decision-making in data mining applications. Entity resolution, the process of linking records that refer to the same entity across one or more data sets, has provided proven advancements in domains ranging from health research, social sciences and genealogy, national censuses, and online shopping, to the domain of crime and fraud detection and prevention. For example, linking the birth, marriage, and death certificates of all individuals in a population to generate family pedigrees can help the health sector with the early identification of hereditary diseases. In recent years, supervised learning methods have propelled to the forefront of entity resolution research because of the high linkage quality that can be obtained with sufficient labelled data. A recent study has, however, found that a random forest based classifier required at least 1.5 million training labels to link two fairly clean data sets of consumer products with a precision and recall of 99%. The process of obtaining such labelled data is often manual and requires extensive human efforts. Therefore, the practical difficulties of applying supervised learning methods usher in the requirement to consider alternatives with limited labelled training data. In this thesis, we propose two main frameworks for addressing the problem of limited training data in entity resolution. We start with a background study and a comprehensive literature review to understand current gaps and research directions. The first is a solution based on transfer learning which allows to exploit labelled data available in a semantically related domain for the classification task of entity resolution. The novelty of our framework is that it can provide improved linkage quality for short structured data, whereas existing transfer learning frameworks for entity resolution are based on deep learning models that provide better results only for unstructured textual data. The second framework is an unsupervised method that does not require any labelled data. Existing unsupervised methods only consider data sets containing basic entities that have static attribute values and static relationships, such as publications in bibliographic data sets. These methods cannot achieve high linkage quality on data sets with complex entities, where an entity (such as a person) can change its attribute values over time while having different relationships with other entities at different points in time. Therefore, we propose an unsupervised graph-based entity resolution framework that is aimed at linking records of complex entities. We then propose a novel method to geocode historical addresses to support our unsupervised entity resolution framework. We finally describe two application case studies of our frameworks that unveil the benefits of the practical impact of our frameworks.

Url: https://openresearch-repository.anu.edu.au/bitstream/1885/288743/1/Thesis_Nishadi_Kirielle_2023.pdf

User Submitted?: No

Authors: Kirielle, Nishadi

Institution: Australian National University

Department:

Advisor:

Degree:

Publisher Location:

Pages: 1-198

Data Collections: IPUMS USA

Topics: Methodology and Data Collection

Countries:

IPUMS NHGIS NAPP IHIS ATUS Terrapop