Bibliography

Full Citation

Title: Theoretical and Applied Problems in Partially Private Data

Citation Type: Dissertation/Thesis

Publication Year: 2023

Abstract: Research in statistical data privacy (SDP) has traditionally self-organized into two disjoint schools of thought: statistical disclosure limitation (SDL) and formal privacy (FP). Both perspectives rely on different units of analysis, measures of disclosure risk, and adversarial assumptions. Yet in recent years, differential privacy (DP), a particular variant of FP, has emerged as the methodologically preferred perspective by analyzing release mechanisms and database schemas under the broadest possible adversarial assumptions. To do so, DP quantifies privacy loss by analyzing noise injected into output statistics. For non-trivial statistics, this noise is necessary to ensure finite privacy loss. However, data curators frequently release collections of statistics where some use DP mechanisms and others are released without additional randomized noise. This includes many cases where DP mechanisms are implemented in such a way that depends on the confidential data, such as by choosing the privacy loss parameter based on confidential data (or synthetic data highly correlated with the confidential data). Consequently, DP alone cannot characterize the privacy loss attributable to the entire joint collection of releases, nor decisions that were made in implementing the mechanism. Such problems pose an existential threat to building DP systems in practice that DP alone cannot answer. In this dissertation, we study the privacy and utility properties of “partially private data" (PPD), collections of statistics where only some are released through DP mechanisms. In particular, we define the random variable Z as “public information" not protected by DP. PPD is inherently statistical, as it relies on assumptions about the correlation structure between private and public information. We present a privacy formalism, (ϵ, {Θz}z∈Z)-Pufferfish (ϵ-TP for short when {Θz}z∈Z is implied), a collection of Pufferfish mechanisms indexed by realizations of Z. First, we prove that this definition has similar properties to DP. Next, we introduce two release mechanisms for publishing (PPD) satisfying ϵ-TP and prove their desirable properties. We additionally introduce perfect sampling algorithms to exactly implement these mechanisms, as well as approximate Bayesian computation algorithms for sampling from the posterior of a parameter given PPD. We then compare this inference approach to the alternative where noisy statistics are deterministically combined with Z. We derive mild conditions under which using our algorithms offers both theoretical and computational improvements over this more common approach. We demonstrate all the effects above on two case studies: one on COVID-19 data, and one on rural mortality data. Finally, we discuss the implications of all the above from a social and legal perspective, with the end goal of using PPD to make FP technologies more accessible to essential social science data curators.

Url: https://etda.libraries.psu.edu/files/final_submissions/27694

User Submitted?: No

Authors: Seeman, Jeremy

Institution: The Pennsylvania State University

Department:

Advisor:

Degree:

Publisher Location:

Pages: 1-168

Data Collections: IPUMS USA

Topics: Methodology and Data Collection

Countries:

citations

BIBLIOGRAPHY

Full Citation