Chapter 9 Anonymity and Confidentiality
Or, Statistical Disclosure Control, Masking, and De-Identification
9.1 Introduction
The confidentiality of many data that are ripe for analysis are covered under some form of legal, ethical, and moral constraints. How do we as researchers ensure that the data we work with–whether it’s in aggregated form or released as a micro (individual) record that is stripped of personal identifiers such as name–remains anonymous, thereby ensuring that any personal information remains private?
Many of the practical applications of statistical disclosure control can be traced back to the work of various statistical agencies around the world, who are confronted with the challenge of balancing increased expectations for openness and transparency (which translates into an increasing demand for the release of data) with “an increasing public consciousness about the privacy of individuals” (Bethlehem, 2009, p.342).
Statistical Disclosure Control (SDC) or Statistical Disclosure Limitation (SDL) seeks to protect statistical data in such a way that they can be released without giving away confidential information that can be linked to specific individuals or entities. (Hundepool et al., 2012, p.1)
Because statistical agencies have this role and take the responsibility very seriously (their reputations rest heavily on the willingness of individuals, businesses, and other public agencies to provide data to them) means much of the material on the topic comes directly or indirectly from those agencies (see, for example, the Australian National Statistical Service and Statistics Canada publications listed below).
A related topic is data masking, “the process of hiding original data with random characters or data.”
The main reason for applying masking to a data field is to protect data that is classified as personal identifiable data, personal sensitive data or commercially sensitive data, however the data must remain usable for the purposes of undertaking valid test cycles. It must also look real and appear consistent. Data masking page at Wikipedia
Sidebar: The Mosaic Effect
Often bandied about is the term “the mosaic effect”. A good definition comes from Vivek Kundra, who at the time was the U.S. Government’s Chief Information Officer:
Individual pieces of data when released independently may not reveal sensitive information but when combined, this “mosaic effect” could be used to derive personal information or information vital to national security. (“Testimony Resolving the Shroud of Secrecy”, 2010-03-23)
Further reading: “Sorry, your data can still be identified even if it’s anonymized”
Aggregation, masking, and other SDC methods should reduce (if not entirely eliminate) a “mosaic effect”. More research into this topic is required.
9.2 Laws, Ethics, and Morals
The Canadian Panel on Research Ethics had created a policy document that includes a chapter on Privacy and Confidentiality, covering the full range of concerns about the use of individual data.
9.3 Discussion
Some definitions: “What is the difference between”de-identified”, “anonymous”, and “coded” data?”
Justin Brickell and Vitaly Shmatikov, 2008: “The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing”
- see also Wikipedia: De-anonymization
Dary Hsu, “Techniques to Anonymize Human Data”, 2015
Gregory J. Matthews, Pétala Gardênia da Silva Estrela Tuy, and Robert K. Arthur, “An examination of statistical disclosure issues related to publication of aggregate statistics in the presence of a known subset of the dataset using Baseball Hall of Fame ballots”, Journal of Quantitative Analysis in Sports, 2017
Arvind Narayanan and Vitaly Shmatikov, 2008: “Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset)”
Arvind Narayanan and Vitaly Shmatikov, 2008: “Robust De-anonymization of Large Sparse Datasets”
see also Bruce Schneier, “Why ‘Anonymous’ Data Sometimes Isn’t”
see also Arvind Narayanan, “Big Data: Anonymity, Privacy, Ethics”, a collection of papers and other resources on the topic
Latanya Sweeney, “Information Explosion” in Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies (Sweeney (2001))
Latanya Sweeney, Michael von Loewenfeldt, and Melissa Perry, “Saying it’s Anonymous Doesn’t Make It So: Re-identifications of “anonymized” law school data” in Journal of Technology Science (Latanya Sweeney (2018))
- additional publications on the topic of data anonymization and re-identification by Latanya Sweeney can be found as part of her CV at latanyasweeney.org
Pete Warden, “Why you can’t really anonymize your data”, 2011
9.4 Statistical agencies: principles and practices
9.4.1 Australia
National Statistical Service (Australia), “How to confidentialise data: the basic principles”
National Statistical Service (Australia), “Managing the risk of disclosure in the release of microdata”
9.4.2 Canada
Statistics Canada, “Disclosure control” (part of Statistics Canada Quality Guidelines, Catalogue 12-539-X)
Statistics Canada, 2011: A New Approach for the Development of a Public Use Microdata File for Canada’s 2011 National Household Survey, Catalogue no. 99-137-XWE2015001.
Mark Stinner, 2017, “Disclosure Control and Random Tabular Adjustment”, SSC Annual Meeting, June 2017. {PDF}
9.4.3 United Kingdom
Government Statistical Service (United Kingdom), 2009:National Statistician’s Guidance: Confidentiality of Official Statistics
Government Statistical Service (United Kingdom): “Statistical Disclosure Control”
Office of National Statistics (United Kingdom), 2004: National Statistics Code of Practice: Protocol on Data Access and Confidentiality
9.4.4 United States of America
Bureau of Labour Statistics (BLS) Disclosure Review Board (U.S.A.), 2012: Balancing Confidentiality Requirements with Data Users’ Information Needs
Census Bureau: Statistical Disclosure Control (SDC)
- “Links and conventional references to research sponsored by the U.S. Census Bureau in the areas of statistical disclosure control, confidentiality, and disclosurelimitation.”
John Abowd, Tweetorial: Reconstruction-abetted re-identification attacks and other traditional vulnerabilities {unrolled twitter thread}
9.5 Other resources
9.5.1 Health agencies: “de-identification”
U.S. Department of Health & Human Sciences, Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule
Committee on Strategies for Responsible Sharing of Clinical Trial Data; Board on Health Sciences Policy; Institute of Medicine. Washington (DC): National Academies Press (US); 2015 Apr 20. Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk: Appendix B: Concepts and Methods for De-identifying Clinical Trial Data
9.5.2 others
Information and Privacy Commissioner of Ontario, June 2016, De-identification Guidelines for Structured Data
Simson L. Garfinkel, October 2015, De-Identification of Personal Information, NISTIR 8053, National Institute of Standards and Technology.
Jelke Bethlehem, 2009: Applied Survey Methods: A Statistical Perspective, Wiley (Wiley Series in Survey Methodology)
- The 13th and final chapter of this comprehensive outline of survey research methods is titled “Statistical Disclosure Control”. While the book focusses on data collected through sample surveys, the principles apply equally to census surveys and administrative records.
Children’s Hospital of Eastern Ontario Research Institute, 2007: Pan-Canadian De-Identification Guidelines for Personal Health Information
Josep Domingo-Ferrer, 2014, “Data Anonymization: A Tutorial”
Anco Hundepool, Josep Domingo-Ferrer, Luisa Franconi, Sarah Giessing, Eric Schulte Nordholt, Keith Spicer, Peter-Paul de Wolf, 2012: Statistical Disclosure Control, Wiley (Wiley Series in Survey Methodology)
- Essentially a handbook of methods to ensure the privacy of individuals, it covers both microdata (individual records) and tabular data, recognizing the inherent differences in the risks associated with each.
Leon Willenborg and Ton de Wall, 1996: Statistical Disclosure Control in Practice, Springer (Lecture Notes in Statistics 111).
Sharon Hewitt, “Making Test Data Realistic – Without Taking It from Production” (iri.com)
9.6 Specific methods
9.6.1 cell perturbation
Jean-Louis Tambay, June 2017: “A layered perturbation method for the protection of tabular outputs”, Survey Methodology (12-001-X), Statistics Canada, vol. 43, no. 1.
Gwenda Thompson, Stephen Broadfoot, Daniel Elazar, 2013: “Methodology for the Automatic Confidentialisation of Statistical Outputs from Remote Servers at the Australian Bureau of Statistics”, Joint UNECE/Eurostat work session on statistical data confidentiality, Ottawa, Canada, 28-30 October 2013.
9.6.2 k-anonymity
Latanya Sweeney, “k-Anonymity: A Model for Protecting Privacy”, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570. 2002
Arvind Narayanan and Vitaly Shmatikov, “Robust De-anonymization of Large Sparse Datasets”
Roberto J. Bayardo and Rakesh Agrawal, “Data Privacy Through Optimal k-Anonymization”
Khaled El Emam and Fida Kamal Dankar, “Protecting Privacy Using k-Anonymity”, J Am Med Inform Assoc. 2008 Sep-Oct; 15(5): 627–637.
9.6.3 pq-Rule
Meghan O’Malley, Lawrence R. Ernst, “Practical Considerations in Applying the pq-Rule for Primary Disclosure Suppressions”, December 2007.
9.7 R
John Mount (1012) “Modeling Trick: Masked Variables”
9.7.1 {sdcMicro}
{sdcMicro} is the best available SDC tool I have found in the R space, so I’ve given it top-billing rather than my usual alpabetical listing.
package
CRAN page: sdcMicro: Statistical Disclosure Control Methods for Anonymization of Microdata and Risk Estimation
github page: sdcMicro (note: this is a sub-page for for a broader set of SDC tools)
articles
Statistical Disclosure Control (SDCMicro) at International Household Survey Network.
Matthias Templ, “Statistical Disclosure Control for Microdata Using the R-Package sdcMicro”, Transactions on Data Privacy 1 (2008) 67–85.
Matthias Templ, Bernhard Meindl and Alexander Kowarik, Introduction to Statistical Disclosure Control (SDC), 2015/2017
Matthias Templ, Alexander Kowarik, Bernhard Meindl (2015) “Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro”, Journal of Statistical Software, Vol.67 No.4
Alexander Kowarik and Matthias Templ, “Make Your Data Confidential with the sdcMicro and sdcMicroGUI packages”, 2012
Daniel Abril, Guillermo Navarro-Arribas and Vicenç Torra (2015) “Data Privacy with R”, researchgate.net, 2015-07-10.
9.7.2 {sdcMicroGUI}
package
CRAN page: sdcMicroGUI: Graphical User Interface for Package ‘sdcMicro’
articles
Matthias Templ, Bernhard Meindl and Alexander Kowarik, Tutorial for sdcMicroGUI (and sdcMicro), 2015
9.7.3 {sdcTable}
CRAN page: sdcTable: Methods for Statistical Disclosure Control in Tabular Data
9.7.4 {easySdcTable}
** package**
CRAN page: easySdcTable: Easy Interface to the Statistical Disclosure Control Package ‘sdcTable’
vignette: easySdcTable
Vignette
9.7.5 {digest}
package
CRAN page: digest: Create Compact Hash Digests of R Objects
articles
Jan Górecki (2014), “Data anonymization in R” * alternate source
9.7.6 {obfuscateR}
(Note: this package has not been submitted to CRAN, and is clearly in development/stalled)
github page: PedramNavid/obfuscateR
9.8 Other tools
Benjamin Bengfort, “A Practical Guide to Anonymizing Datasets with Python & Faker: How Not to Lose Friends and Alienate People”, 2016
-30-