Statistical Learning Theory Approach for Data Classification with l-diversity
Primary Investigator:
Chris Clifton
Chris Clifton and Koray Mancuhan
Abstract
Corporations are retaining ever-larger corpuses of personal
data; the frequency or breaches and corresponding privacy
impact have been rising accordingly. One way to mitigate
this risk is through use of anonymized data, limiting the
exposure of individual data to only where it is absolutely
needed. This would seem particularly appropriate for data
mining, where the goal is generalizable knowledge rather
than data on specific individuals. In practice, corporate
data miners often insist on original data, for fear that they
might "miss something" with anonymized or differentially
private approaches. This paper provides a theoretical
justification for the use of anonymized data. Specially, we
show that a support vector classifier trained on anatomized
data satisfying l-diversity should be expected to do as
well as on the original data. Anatomy preserves all data
values, but introduces uncertainty in the mapping between
identifying and sensitive values, thus satisfying l-diversity.
The theoretical effectiveness of the proposed approach is
validated using several publicly available datasets, showing
that we outperform the state of the art for support vector
classification using training data protected by k-anonymity,
and are comparable to learning on the original data.