Abstract
In the information age, vast amounts of sensitive personal information are collected by companies, institutions and governments. A key technological challenge is how to design mechanisms for effectively extracting knowledge from data while preserving the privacy of the individuals involved. In this dissertation, we address this challenge from the perspective of differentially private data publishing. Firstly, we propose PrivPfC, a differentially private method for releasing data for classification. The key idea underlying PrivPfC is to privately select, in a single step, a grid, which partitions the data domain into a number of cells. This selection is done using the exponential mechanism with a novel quality function, which maximizes the expected number of correctly classified records by a histogram classifier. PrivPfC supports both the binary classification as well as the multiclass classification. Secondly, we study the problem of differentially private k-means clustering. We develop techniques to analyze the empirical error behaviors of the existing interactive and non-interactive approaches. Based on the analysis, we propose an improvement of the DPLloyd algorithm which is a differentially private version of the Lloyd algorithm and propose a non-interactive approach EUGkM which publishes a differentially private synopsis for k-means clustering. We also propose a hybrid approach that combines the advantages of the improved version of DPLloyd and EUGkM. Finally, we investigate the sparse vector technique (SVT) which is a fundamental technique for satisfying differential privacy in answering a sequence of queries. We propose a new version of SVT that provides better utility by introducing an effective technique to improve the performance of SVT in the interactive setting. We also show that in the non-interactive setting (but not the interactive setting), usage of SVT can be replaced by the exponential mechanism.