Application of data mining techniques in bioinformatics

Singh , Amartya and Das , Kundan Kumar (2007) Application of data mining techniques in bioinformatics. BTech thesis.



With the widespread use of databases and the explosive growth in their sizes, there is a need to effectively utilize these massive volumes of data. This is where data mining comes in handy, as it scours the databases for extracting hidden patterns, finding hidden information, decision making and hypothesis testing. Bioinformatics, an upcoming field in today’s world, which involves use of large databases can be effectively searched through data mining techniques to derive useful rules. Based on the type of knowledge that is mined, data mining techniques [1] can be mainly classified into association rules, decision trees and clustering. Until recently, biology lacked the tools to analyze massive repositories of information such as the human genome database [3]. The data mining techniques are effectively used to extract meaningful relationships from these data.Data mining is especially used in microarray analysis which is used to study the activity of different cells under different conditions. Two algorithms under each mining techniques were implemented for a large database and
compared with each other.
1. Association Rule Mining: - (a) a priori (b) partition
2. Clustering: - (a) k-means (b) k-medoids
3. Classification Rule Mining:- Decision tree generation using (a) gini index (b) entropy value. Genetic algorithms were applied to association and classification techniques. Further, kmeans and Density Based Spatial Clustering of Applications of Noise (DBSCAN) clustering techniques [1] were applied to a microarray dataset and compared. The microarray dataset was downloaded from internet using the Gene Array Analyzer Software(GAAS).The clustering was done on the basis of the signal color intensity of the genes in the microarray experiment. The following results were obtained:-
1. Association:- For smaller databases, the a priori algorithm works better than partition algorithm and for larger databases partition works better.
2. Clustering:- With respect to the number of interchanges, k-medoids algorithm works better than k-means algorithm.
3. Classification:- The results were similar for both the indices (gini index and entropy value). The application of genetic algorithm improved the efficiency of the association and classification techniques. For the microarray dataset, it was found that DBSCAN is less efficient than k-means when the database is small but for larger database DBSCAN is more accurate and efficient in terms of no. of clusters and time of execution. DBSCAN execution time increases linearly with the increase in database and was much lesser than that of k-means for larger database. Owing to the involvement of large datasets and the need to derive results from them, data mining techniques can be effectively put in use in the field of Bio-informatics [2]. The techniques can be applied to find associations among the genes, cluster similar gene and protein sequences and draw decision trees to classify the genes. Further, the data mining techniques can be made more efficient by applying genetic algorithms which greatly improves the search procedure and reduces the execution time.

Item Type:Thesis (BTech)
Uncontrolled Keywords:Data mining techniques, Bioinformatics
Subjects:Engineering and Technology > Computer and Information Science > Data Mining
Divisions: Engineering and Technology > Department of Computer Science
ID Code:4154
Deposited By:Hemanta Biswal
Deposited On:19 Jun 2012 11:32
Last Modified:27 Jun 2012 15:15
Supervisor(s):Rath, S K

Repository Staff Only: item control page