Abstract
k-anonymization techniques are a key component of any
comprehensive solution to data privacy and have been the focus of
intense research in the last few years. An important requirement for
such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications such as generalization and suppression. Current solutions, however, suffer from one or more of the following limitations: reliance on pre-defined generalization hierarchies; generation of anonymized data with high information loss and with high classification errors; and the inference channel arising from
lack of diversity in the sensitive information.
In this paper we propose an approach that addresses these limitations. Our approach uses the idea of clustering to minimize information loss and thus ensure good data quality. The key
observation here is that data records that are naturally close with
respect to each other should be part of the same equivalence class. Current clustering techniques, however, are not directly applicable
in this context because they do not consider the requirement that
each cluster should contain at least k records. We thus formulate a specific clustering problem, referred to as k-member clustering problem. We prove that this problem is NP-hard and present a greedy algorithm, the complexity of which is in O(n^2). As part of our approach we develop a suitable metric to estimate the information loss introduced by generalizations, which works for both numeric and categorical data. We also present extensions to our proposed algorithm that minimize classification
errors in the anonymized data and eliminate the inference channel
arising from lack of diversity in the sensitive attributes.
Subject
Privacy, Anonymity, Data sharing, Clustering