Design of Feature Selection Techniques and Classifiers for Big Data analysis

Ray, Ransingh Biswajit (2016) Design of Feature Selection Techniques and Classifiers for Big Data analysis. MTech thesis.

[img]PDF (Fulltext is restricted upto 26.04.2020)
Restricted to Repository staff only



The addition of knowledge and data has increased exponentially in the last decade or so. Previously only data entry operators were hired for uploading and entering useful information, but now with the advent of social networks and mobile internet, users are directly uploading semi structured and unstructured data on to the web. Each day some exabytes amount of data are being added to the world wide web. Traditional data mining techniques are facing huge computational and analytical challenges while handling these huge data. They are incompetent or inept in handling such huge volume of data, as they are not adapted to the new space and time requirements. Several cluster computing framework such as MapReduce, Spark and MPI are gradually gaining importance for handling and executing large scale data. Reducing the execution time has become the main focus of attention in this era of Big Data. Distributed computing is a method of processing large complex data using various parallel processing paradigm. MapReduce framework of Hadoop is one such cluster computing framework, which has gain fame for fast processing of Big Data, but it could not express same productivity for numerous iterative learning algorithms. A new cluster computing frame-work called Spark gives similar scalability and fault tolerance properties as MapReduce, and can run up to 100 times faster than Hadoop. We have proposed Spark based scalable statistical tests like ANOVA, Kruskal Wallis, Friedman, and information gain variants like Mutual Information and Mutual Information Feature Selection as well as Spark based scalable classifiers like SVM, Naive Bayes, Logistic Regression and Linear Regression for rapid processing and prediction of large datasets.

Item Type:Thesis (MTech)
Uncontrolled Keywords:Big Data; ANOVA; Kruskal-Wallis; Friedman; MI ; MIFS; SVM ; Naïve Bayes; Logistic Regression; Linear Regression; Spark; Hadoop; HDFS; RDD
Subjects:Engineering and Technology > Computer and Information Science > Data Mining
Engineering and Technology > Computer and Information Science
Divisions: Engineering and Technology > Department of Computer Science
ID Code:9317
Deposited By:Mr. Sanat Kumar Behera
Deposited On:27 Apr 2018 20:38
Last Modified:27 Apr 2018 20:38
Supervisor(s):Rath, Santanu Kumar

Repository Staff Only: item control page