Utility-Based Privacy Preserving Data Publishing

Babu, Korra Sathya (2013) Utility-Based Privacy Preserving Data Publishing. PhD thesis.



Advances in data collection techniques and need for automation triggered in proliferation of a huge amount of data. This exponential increase in the collection of
personal information has for some time represented a serious threat to privacy. With the advancement of technologies for data storage, data mining, machine learning, social networking and cloud computing, the problem is further fueled. Privacy is a fundamental
right of every human being and needs to be preserved. As a counterbalance to the socio-technical transformations, most nations have both general policies on preserving
privacy and specic legislation to control access to and use of data. Privacy preserving data publishing is the ability to control the dissemination and use of one's personal
Mere publishing (or sharing) of original data in raw form results in identity disclosure with linkage attacks. To overcome linkage attacks, the techniques of statistical disclosure control are employed. One such approach is k-anonymity that reduce data across a set of key variables to a set of classes. In a k-anonymized dataset each record is indistinguishable from at least k-1 others, meaning that an attacker cannot link the data records to population units with certainty thus reducing the probability of disclosure. Algorithms that have been proposed to enforce k-anonymity are Samarati's algorithm and Sweeney's
Datafly algorithm. Both of these algorithms adhere to full domain generalization with global recording. These methods have a tradeo between utility, computing time and
information loss. A good privacy preserving technique should ensure a balance of utility and privacy, giving good performance and level of uncertainty. In this thesis, we propose an improved greedy heuristic that maintains a balance between utility, privacy, computing
time and information loss.
Given a dataset and k, constructing the dataset to k-anonymous dataset can be done by the above-mentioned schemes. One of the challenges is to nd the best value of k,
when the dataset is provided. In this thesis, a scheme has been proposed to achieve the best value of k for a given dataset.
The k-anonymity scheme suers from homogeneity attack. As a result, the l-diverse scheme was developed. It states that the diversity of domain values of the dataset in an equivalence class should be l. The l-diversity scheme suers from background knowledge attack. To address this problem, t-closeness scheme was proposed. The t-closeness principle states that the distribution of records in an equivalence class and the distribution of records in the table should not exceed more than t. The drawback with this scheme is that, the distance metric deployed in constructing a table, satisfying t-closeness, does not follow the distance characteristics. In this thesis, we have deployed
an alternative distance metric namely, Hellinger metric, for constructing a t-closeness table. The t-closeness scheme with this alternative distance metric performed better with respect to the discernability metric and computing time.
The k-anonymity, l-diversity and t-closeness schemes can be used to anonymize the dataset before publishing (releasing or sharing). This is generally in a static environment.
There are also data that need to be published in a dynamic environment. One such example is a social network. Anonymizing social networks poses great challenges.
Solutions suggested till date do not consider utility of the data while anonymizing. In this thesis, we propose a novel scheme to anonymize the users depending on their importance and take utility into consideration. Importance of a node was decided by the centrality
and prestige measures. Hence, the utility and privacy of the users are balanced.

Item Type:Thesis (PhD)
Uncontrolled Keywords:k-anonymity, l-diversity, t-closeness
Subjects:Engineering and Technology > Computer and Information Science > Data Mining
Divisions: Engineering and Technology > Department of Computer Science
ID Code:5487
Deposited By:Hemanta Biswal
Deposited On:21 Mar 2014 11:46
Last Modified:21 Mar 2014 11:46
Supervisor(s):Jena, Sanjay Kumar

Repository Staff Only: item control page