Sharma, Bikash and Jain, Aman (2007) Privacy preserving data mining. BTech thesis.
A fruitful direction for future data mining research will be the development of technique that incorporates privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We analyze the possibility of privacy in data mining techniques in two phasesrandomization and reconstruction. Data mining services require accurate input data for their results to be meaningful, but privacy concerns may influence users to provide spurious information. To preserve client privacy in the data mining process, techniques based on random perturbation of data records are used. Suppose there are many clients, each having some personal information, and one server, which is interested only in aggregate, statistically significant, properties of this information. The clients can protect privacy of their data by perturbing it with a randomization algorithm and then submitting the randomized version. This approach is called randomization. The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. For the concept of using value distortion to protect privacy to be useful, we need to be able to reconstruct the original data distribution so that data mining techniques can be effectively utilized to yield the required statistics.
Let xi be the original instance of data at client i. We introduce a random shift yi using randomization technique explained below. The server runs the reconstruction algorithm (also explained below) on the perturbed value zi = xi + yi to get an approximate of the original data distribution suitable for data mining applications. Randomization We have used the following randomizing operator for data perturbation: Given x, let R(x) be x+€ (mod 1001) where € is chosen uniformly at random in .
Reconstruction of discrete data set
P(X=x) = f X (x) ----Given
P(Y=y) = F y (y) ---Given
P (Z=z) = f Z (z) ---Given
f (X/Z) = P(X=x | Z=z)
= P(X=x, Z=z)/P (Z=z)
= P(X=x, X+Y=Z)/ f Z (z)
= P(X=x, Y=Z - X)/ f Z (z)
= P(X=x)*P(Y=Z-X)/ f Z (z)
= P(X=x)*P(Y=y)/ f Z (z)
In this project we have done two aspects of privacy preserving data mining. The first phase involves perturbing the original data set using ‘randomization operator’ techniques and the second phase deals with reconstructing the randomized data set using the proposed algorithm to get an approximate of the original data set. The performance metrics like percentage deviation, accuracy and privacy breaches were calculated. In this project we studied the technical feasibility of realizing privacy preserving data mining. The basic promise was that the sensitive values in a user’s record will be perturbed using a randomizing function and an approximate of the perturbed data set be recovered using reconstruction algorithm.
|Item Type:||Thesis (BTech)|
|Uncontrolled Keywords:||Data mining, Perturbed value|
|Subjects:||Engineering and Technology > Computer and Information Science|
|Divisions:||Engineering and Technology > Department of Computer Science|
|Deposited By:||Hemanta Biswal|
|Deposited On:||27 Jun 2012 09:50|
|Last Modified:||27 Jun 2012 09:50|
|Supervisor(s):||Turuk, A K|
Repository Staff Only: item control page