Abstract
Although data disclosure is advantageous for many obvious reasons, it may incur some risk resulting from potential security breaches. An example of such privacy violation occurs when an adversary reconstructs the original data using additional information. Moreover, sharing private information such as address and telephone number in social networks is always subject to a potential misuse. In this dissertation, we address both the scalability and privacy risk of data anonymization. We develop a framework that assesses the relationship between the disclosed data and the resulting privacy risk and use it to determine the optimal set of transformations that need to be performed before data is disclosed.
We propose a scalable algorithm that meets differential privacy when applying a specific random sampling.
The main contribution of this dissertation is three-fold: (i) we show that determining the optimal transformations is an NP-hard problem and propose a few approximation heuristics, which we justify experimentally, (ii) we propose a personalized anonymization technique based on an aggregate (Lagrangian) formulation and prove that it could be solved in polynomial time, and (iii) we show that combining the proposed aggregate formulation with specific sampling gives an anonymization algorithm that satisfies differential privacy. Our results rely heavily on exploring the supermodularity properties of the risk function, which allow us to employ techniques from convex optimization. Finally, we use the proposed model to assess the risk of private information sharing in social networks.
Through experimental studies we compare our proposed algorithms with other anonymization schemes in terms of both time and privacy risk. We show that the proposed algorithm is scalable. Moreover, we compare the performance of the proposed approximate algorithms with the optimal algorithm and show that the sacrifice in risk is outweighed by the gain in efficiency.