Protein Superfamily Classification using Computational Intelligence Techniques

Vipsita, Swati (2014) Protein Superfamily Classification using Computational Intelligence Techniques. PhD thesis.

[img]PDF
2114Kb

Abstract

The problem of protein superfamily classification is a challenging research area in Bioinformatics and has its major application in drug discovery. If a newly discovered protein which is responsible for the cause of new disease
gets correctly classified to its superfamily, then the task of the drug analyst becomes much easier. The analyst can perform molecular docking to find the correct relative orientation of ligand for the protein. The ligand database
can be searched for all possible orientations and conformations of the protein belonging to that superfamily paired with the ligand. Thus, the search space
is reduced enormously as the protein-ligand pair is searched for a particular protein superfamily. Therefore, correct classification of proteins becomes a very
challenging task as it guides the analysts to discover appropriate drugs. In this thesis, Neural Networks (NN), Multiobjective Genetic Algorithm (MOGA),and Support Vector Machine (SVM) are applied to perform the classification
task.Adaptive MultiObjective Genetic Algorithm (AMOGA), which is a variation of MOGA is implemented for the structure optimization of Radial Basis Function Network (RBFN). The modification to MOGA is done based on the
two key controlling parameters such as probability of crossover and probability of mutation. These values are adaptively varied based upon the performance of
the algorithm, i.e., based upon the percentage of the total population present in the best non-domination level. The problem of finding the number of hidden centers remains a critical issue for the design of RBFN. The most optimal RBF
network with good generalization ability can be derived from the pareto optimal set. Therefore, every solution of the pareto optimal set gives information regarding the specific samples to be chosen as hidden centers as well as the update weight matrix connecting the hidden and output layer. Principal Component Analysis (PCA) has been used for dimension reduction and significant feature extraction from long feature vector of amino acid sequences.In two-stage approach for protein superfamily classification, feature extraction process is carried in the first stage and design of the classifier has been proposed in the second stage with an overall objective to maximize the performance
accuracy of the classifier. In the feature extraction phase, Genetic Algorithm(GA) based wrapper approach is used to select few eigen vectors from the PCA space which are encoded as binary strings in the chromosome. Using
PCA-NSGA-II (non-dominated sorting GA), the non-dominated solutions obtained from the pareto front solves the trade-off problem by compromising between the number of eigen vectors selected and the accuracy obtained by the
classifier. In the second stage, Recursive Orthogonal Least Square Algorithm (ROLSA) is used for training RBFN. ROLSA selects the optimal number of

Item Type:Thesis (PhD)
Uncontrolled Keywords:MOGA, pareto front, non-domination level, probabilities of crossover and mutation, n-gram feature extraction, orthogonal least square algorithm,eigen vector, kernel function, hyper-parameters
Subjects:Engineering and Technology > Computer and Information Science
Divisions: Engineering and Technology > Department of Computer Science
ID Code:5655
Deposited By:Hemanta Biswal
Deposited On:30 Jul 2014 14:04
Last Modified:30 Jul 2014 14:04
Supervisor(s):Rath, S K

Repository Staff Only: item control page