Full Citation
Title: Top-Coding and Public Use Microdata Samples from the U.S. Census Bureau
Citation Type: Journal Article
Publication Year: 2014
ISBN:
ISSN:
DOI:
NSFID:
PMCID:
PMID:
Abstract: The US Census Bureau regularly releases Public Use Microdata Samples (PUMS), datafiles which contain de-identified subsets of the data provided by respondents to someof its various surveys and to the Decennial Census itself. This allows data users toperform “micro” -analyses rather than the “macro” -tabulations which are regularlyperformed by the Bureau. These data users range from non-government (say, university)researchers to government policymakers. These micro-analyses typically depend on thejoint distribution of two or more variables over individuals or households. As a verysimple example, think of the relationship of wages of individuals to their individualages by a linear regression equation. We will use this very simple example throughoutthis paper to illustrate the effects we are interested in. In order to protect the privacyof the data supplied by respondents, as required by Title 13 U.S.C., the Bureau usesa variety of methods to modify the data so that it is very difficult for data users toidentify individual respondents. Although some kind of privacy protection measuresare necessary by law, most of them (top-coding, in particular) have a detrimental effecton the micro-analyses because application of these privacy protection measures changesthe interdependence of two or more variables and, in many cases, renders the analysesmoot.This paper is a very brief review of Census Bureau privacy protection methods and asmall exploration of the effect that top-coding, in particular, has on some specific micro-analyses. Throughout this document: a) we have focussed on the American CommunitySurvey (ACS) PUMS because it is one of the richest national datasets; and b) we haveused Alaska and California as example states because they have, respectively, very smalland very large populations and because the age distributions and wage distributionsare quite different between the two states. We have performed each of our analysesfor every state and the results for the other states are available in the supplementarymaterials. In Section 2 we discuss privacy protection methods in a little more detailand, in particular, focus on a detailed understanding of top-coding, as currently usedby the Census Bureau. In Section 3 we give a brief description of the data sets that weused in our attempts to correct top-coding in Section 4. We introduce the Health andRetirement Study, a non-Census Bureau survey, as a potential tool for correcting the effect of top-coding. In Section 4 we describe the various correction approaches we tried,why they failed, and why there appear to be no other viable approaches to restoring thedistributional properties (e.g., the correlation) of pairs of variables, at least one of whichhas been top-coded. Section 6 discusses the errors in the Census PUMS discovered by[1] and the fix provided by the Census Bureau, and some additional errors we discoveredin the Minnesota Population Center IPUMS. Finally, in Section 7 we briefly discuss theimplications of our study for statistical and economic analyses based on PUMS datawhich have been top-coded.
Url: https://journalprivacyconfidentiality.org/index.php/jpc/article/view/639/622
User Submitted?: No
Authors: Crimi, Nicole; Eddy, William, F
Periodical (Full): Journal of Privacy and Confidentiality
Issue: 2
Volume: 6
Pages: 21-58
Data Collections: IPUMS USA
Topics: Other
Countries: United States