Efficient k-anonymity Using Clustering Technique

Get BibTex-formatted data

Download

PDF

Author

Ji-Won Byun and Ashish Kamra and Elisa Bertino and Ninghui Li

Tech report number

CERIAS TR 2006-10

Entry type

techreport

Abstract

k-anonymization techniques are a key component of any comprehensive solution to data privacy and have been the focus of intense research in the last few years. An important requirement for such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications such as generalization and suppression. Current solutions, however, suffer from one or more of the following limitations: reliance on pre-defined generalization hierarchies; generation of anonymized data with high information loss and with high classification errors; and the inference channel arising from lack of diversity in the sensitive information. In this paper we propose an approach that addresses these limitations. Our approach uses the idea of clustering to minimize information loss and thus ensure good data quality. The key observation here is that data records that are naturally close with respect to each other should be part of the same equivalence class. Current clustering techniques, however, are not directly applicable in this context because they do not consider the requirement that each cluster should contain at least k records. We thus formulate a specific clustering problem, referred to as k-member clustering problem. We prove that this problem is NP-hard and present a greedy algorithm, the complexity of which is in O(n^2). As part of our approach we develop a suitable metric to estimate the information loss introduced by generalizations, which works for both numeric and categorical data. We also present extensions to our proposed algorithm that minimize classification errors in the anonymized data and eliminate the inference channel arising from lack of diversity in the sensitive attributes.

Download

PDF

Institution

Purdue University

Key alpha

byun

Affiliation

CERIAS

Publication Date

1900-01-01

Subject

Privacy, Anonymity, Data sharing, Clustering

BibTex-formatted data

To refer to this entry, you may select and copy the text below and paste it into your BibTex document. Note that the text may not contain all macros that BibTex supports.

@Techreport{ byun,
	title = "Efficient k-anonymity Using Clustering Technique",
	author = "Ji-Won Byun and Ashish Kamra and Elisa Bertino and Ninghui Li",
	institution = "Purdue University",
	abstract = "k-anonymization techniques are a key component of any
comprehensive solution to data privacy and have been the focus of
intense research in the last few years. An important requirement for
such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications such as generalization and suppression. Current solutions, however, suffer from one or more of the following limitations: reliance on pre-defined generalization hierarchies; generation of anonymized data with high information loss and with high classification errors; and the inference channel arising from
lack of diversity in the sensitive information.
In this paper we propose an approach that addresses these limitations. Our approach uses the idea of clustering to minimize information loss and thus ensure good data quality. The key
observation here is that data records that are naturally close with
respect to each other should be part of the same equivalence class. Current clustering techniques, however, are not directly applicable
in this context because they do not consider the requirement that
each cluster should contain at least k records. We thus formulate a specific clustering problem, referred to as k-member clustering problem. We prove that this problem is NP-hard and present a greedy algorithm, the complexity of which is in O(n^2). As part of our approach we develop a suitable metric to estimate the information loss introduced by generalizations, which works for both numeric and categorical data. We also present extensions to our proposed algorithm that minimize classification
errors in the anonymized data and eliminate the inference channel
arising from lack of diversity in the sensitive attributes.",
	affiliation = "CERIAS",
	subject = "Privacy, Anonymity, Data sharing, Clustering",
}