Balancing between data utility and privacy preservation in data mining

Jain, Sachin kumar and Tandon, Ankit (2010) Balancing between data utility and privacy preservation in data mining. BTech thesis.

PDF
1553Kb

Abstract

Data Mining plays a vital role in today‟s information world where it has been widely applied in various organizations. The current trend needs to share data for mutual benefit. However, there has been a lot of concern over privacy in the recent years .It has also raised a potential threat of revealing sensitive data of an individual when the data is released publically. Various methods have been proposed to tackle the privacy preservation problem like anonymization and perturbation. But the natural consequence of privacy preservation is information loss. The loss of specific information about certain individuals may affect the data quality and in extreme case the data may become completely useless. There are methods like cryptography which completely anonymize the dataset and which renders the dataset useless. So the utility of the data is completely lost. We need to protect the private information and preserve the data utility as much as possible. So the objective of the thesis is to find an optimum balance between privacy and utility while publishing dataset of any organization. Privacy preservation is hard requirement that must be satisfied and utility is the measure to be optimized.
One of the methods for preserving privacy is K-anonymization which also preserves privacy to a good extent. K-anonymity demands that every tuple in the dataset released be indistinguishably related to no fewer than k respondents. We used K-means algorithm for clustering the dataset and followed by k-anonymization. Decision stump classification is used to determine utility and privacy is determined by firing random queries on the anonymized dataset. The balancing point is where the utility and privacy curves intersect or they tend to converge. The balancing point will vary from dataset to dataset and the choice of Quasi-identifier and sensitive attribute. For our experiment the balancing point is found to be around 50-60 percent which is the intersecting point of privacy and utility curves.

Item Type:	Thesis (BTech)
Uncontrolled Keywords:	Data Mining, Balancing, Utility, Privacy, Generalization
Subjects:	Engineering and Technology > Computer and Information Science > Data Mining
Divisions:	Engineering and Technology > Department of Computer Science
ID Code:	1651
Deposited By:	Sachin Kumar Jain
Deposited On:	17 May 2010 15:28
Last Modified:	18 May 2010 09:42
Related URLs:	Dspace
Supervisor(s):	Jena, S K

Repository Staff Only: item control page