Analysis of Intrusion Data using Scalable Machine Learning Techniques

Sahu, Santosh Kumar (2022) Analysis of Intrusion Data using Scalable Machine Learning Techniques. PhD thesis.

PDF (Restricted upto 05/12/2024)
Restricted to Repository staff only
19Mb

Abstract

This thesis presents our work concerning the design and implements intrusion detection systems using scalable machine learning approaches. To design IDS, a large amount of threat signature is required for the machine learning-based detection approach. Hence our first work focuses on towards preparation of a new intrusion dataset with the latest threat signature. A testbed is created in our lab to launch the attacks, capture the attack pattern, and store them in packet captured (PCAP) data format. The most crucial and tedious work is feature generation, preprocess of features, assigning the class label, and making it compatible for data analysis. To implement the detection engine, we start with signature-based detection approaches using Snort and BroIDS. The rules were written for many attacks by analyzing their signatures and tested them in our laboratory. This approach is more suitable for known attacks. This approach’s main disadvantage is the impossibility of detecting new intrusions because they only look for patterns that match the rules stored in the databases. The database needs frequent updates as recent attacks are discovered. Human intervention is required for attack detection, signature generation, rule generation, and distribution. The anomaly-based approach is implemented using unsupervised and supervised techniques to overcome the disadvantages of signature-based detection approaches. The unsupervised anomaly detection is having a good idea of taking the unlabeled data and based on characteristics of data distribution, the class label is predicted for each object. Because of this property, there is no need for labeled training data. This method is suitable for finding new attacks without prior knowledge about the attack. However, it raises false alarms due to malicious activity that does not significantly change from regular activity in some cases. Further, to improve the detection process, the ensemble approach comes into the picture and takes a vital role in constructing a robust detection engine and achieve better results as compared to earlier techniques. It divides the problem with all the learners who participated in the training process and combines them. The data analysts use the ensemble approach using popular machine learning, design pattern recognition, data mining, neural networks, and measurement techniques to solving various problems. It is well known that an ensemble is typically similar to an individual learner, and ensemble strategies have made impressive progress in the different real-world tasks. The ensemble approach is based a supervised as well as unsupervised learning algorithms. In the ensemble method, the number of weak learners is trained, compares the learning error, and rearranges the data distribution by multiplying the weight vector until the error is minimized. Making predictive models from the available knowledge base is known as the learning or training process. The trained model is known as a hypothesis. The ensemble approach aims to combine multiple hypotheses to form a better hypothesis that provides a stable output in terms of accuracy and the least error using the same weak learner. The ensemble approach can be more flexible by adopting the model optimization techniques. But it increased the complexity of the model and needs high computational resources to process the data. Big Data Environment makes the existing developed algorithms scalable that support a high volume of data for analysis. As per the study, the bottleneck during processing does not occur in distributed processing. As per the Map-Reduce functions, the data calls the program for processing. The probability of bottleneck can be avoided in the large-scale analytic process due to the program’s memory size being negligible compared to the size of data. The significant advantage is that the program comes to the data for processing. In conventional data processing, the program calls the data as an argument that leads towards bottleneck and requires high computational resources during execution. Big Data Processing is quite different. The size of the program is significantly less as compared to the data. The Map-Reduce, Spark, and HDFS (Hadoop Distributed File System) provide reliability, flexibility, high performance, and efficiency in storing, managing, monitoring, processing, and visualizing the data. An efficient IDS approach is implemented to deal with the imbalanced data using supervised and unsupervised techniques on a scalable platform to detect minority attack classes. For more sophisticated attacks and to address multi-class problems, a particular type of threat, a deep neural network-based detection approach is implemented. Deep Learning Models require a large amount of training data during the learning process. The intrusion dataset and our prepared dataset are used to train, validate, and test the models. Empirically, we have observed that the DNN models able to detect the broad attack classes more efficiently.

Item Type:	Thesis (PhD)
Uncontrolled Keywords:	Intrusion Detection System; NITRIDS Intrusion Dataset; Multistage IDS; Balancing dataset; DNN based IDS; Big Data Analytics; LoF.
Subjects:	Engineering and Technology > Computer and Information Science > Data Mining Engineering and Technology > Computer and Information Science > Image Processing
Divisions:	Engineering and Technology > Department of Computer Science Engineering
ID Code:	10307
Deposited By:	IR Staff BPCL
Deposited On:	05 Dec 2022 17:05
Last Modified:	05 Dec 2022 17:05
Supervisor(s):	Jena, Sanjay Kumar and Mohapatra, Durga Prasad

Repository Staff Only: item control page