Abstract
Maximizing data usage and minimizing privacy risk are two
conflicting goals. Organizations always hide the owners'
identities and then apply a set of transformations on their
data before releasing it. While determining the best set
of transformations has been the focus of extensive work in
database community, most of this work suffered from one
or two of the following major problems: scalability and privacy guarantee. To the best of our knowledge, none of the proposed scalable anonymization techniques provides privacy guarantees supported with well-formulated theoretical foundation. Differential privacy provides a theoretical formulation for privacy that ensures that the system behaves essentially the same way, regardless of whether any individual, or small group of individuals, are included in the
database.
In this paper, we address both scalability and privacy risk
of data anonymization. We propose a scalable algorithm
that meets differential privacy when applying a specific random sampling. The contribution of the paper is three-fold:
(1) We prove that determining the optimal transformations
is an NP-hard problem and propose a heuristic approximation based on genetic algorithms, (2) we propose a personalized anonymization technique based on Lagrangian formulation and prove that it could be solved in polynomial time, and (3) we prove that a variant of the proposed Lagrangian technique with specific sampling satisfies differential privacy.
Through experimental studies we compare our proposed
algorithm with other anonymization schemes in terms of
both time and privacy risk. We show that the proposed algorithm is scalable. Moreover, we compare the performance
of the proposed approximate algorithm with the optimal algorithm and show that the sacrifice in risk is outweighed by
the gain in e±ciency.