Detection of Deceptive Online Reviews using Machine Learning Techniques

Rout, Jitendra Kumar (2018) Detection of Deceptive Online Reviews using Machine Learning Techniques. PhD thesis.

PDF (Full text is restricted up-to 29.09.2020 )
Restricted to Repository staff only
1089Kb

Abstract

In the present day scenario, individuals or decision makers in any organization are very much influenced by online opinion forums as well as review websites for accepting or rejecting any particular item or product. Review websites have become one of the key platforms for consumers to compare products as well as services and consequently give views and experience regarding the same. These customers’ reviews are increasingly used by individuals, manufacturers and retailers before taking any business decisions. As there is no scrutiny over the reviews received, sometimes these reviews leads to review spam. Moreover, driven by the desire of some amount of advantage or publicity or both, spammers produce synthesized reviews to promote or demote certain products or brands. Opinion spamming can hurt both consumers and damage businesses, also it has the potential to create
social and political chaos. Hence, they need to be detected to ensure that the social media and/or review sites continue to be a trusted source of public opinions, and hence should not possess any sort of fake or deceptive reviews. As opinions in social media and forums are increasingly used in practice, opinion spamming is becoming more and more sophisticated, which presents a major challenge for their detection. Even though opinion spam or detection of fake reviews has attracted significant research attention in recent years, the problem is still huge and a great deal of research is being carried out by various researchers and practitioners. For mitigation of this type of fake reviews available labeled as well as unlabeled data being generated on daily basis are taken as base for analysis. Keeping these things in mind, this study intends to have in-depth analysis and design of methodology for detecting opinion spams. Based on the degree of availability of labeled data, three major types of Machine Learning (ML) approaches have been considered— Supervised Learning, Semi-supervised Learning and Unsupervised Learning.
Our investigation starts with supervised learning techniques to identify review spam, based on labeled data. A unified model has been proposed to filter malicious spam from the genuine ones using the only publicly available standard dataset. Most effective feature sets have been assembled for model building. Sentiment analysis has also been incorporated in the detection process. In order to get best performance, some well-known classifiers have been applied on labeled dataset. It is observed that though n-gram features and linguistic features have been adopted by a few researchers earlier, none of them have proposed any unified model that work universally for all types of data. An accuracy of 92.12% was obtained by the proposed model which is fairly better than existing models and also considers vii the features mentioned individually.
One of the major challenge in this area of research is due to unavailability of enoughstandard datasets. Owing to the difficulty of manual labeling of training examples, unsupervised learning techniques have been explored to identify spams using unlabeled data. Amazon’s unlabeled Cell Phone review dataset is used for this purpose. Clustering is used after desired attributes were computed for spam detection. Experimental evaluation resulted in 7.684% of reviews being identified as outliers and hence ruled as spam reviews.
In an attempt to utilize minimal labeled data and to take advantage of huge available unlabeled data, semi-supervised learning methods have been applied by using labeled datasets to label rest of the unlabeled data. Five different algorithms have been adapted and applied in order to detect deceptive reviews. The data set used for this purpose is a more varied one and also more number of features has been incorporated for the purpose. From the obtained results, it is observed that the proposed semi-supervised methods are able to classify the datasets efficiently and the performance is better than the existing works.

Given the high velocity of review generation on daily basis, handling and processing such huge data becomes the bottleneck. To overcome this problem, Big Data approach is used in which a distributed and scalable cluster like Hadoop is needed for storing (HDFS)and processing (MapReduce as well as Spark) the datasets in an efficient way. Experiments performed show that the proposed big data analytics framework effectively demonstrates the need for developing fitting models and efficiently detecting deceptive reviewers from a big review system data stream.

Item Type:	Thesis (PhD)
Uncontrolled Keywords:	Review Analysis; Online Review Spam; Review Spam Detection Techniques; Supervised Learning; Unlabeled Reviews; Semi-supervised Learning; Big Data Analysis
Subjects:	Engineering and Technology > Computer and Information Science > Wireless Local Area Network Engineering and Technology > Computer and Information Science > Information Security
Divisions:	Engineering and Technology > Department of Computer Science
ID Code:	9414
Deposited By:	IR Staff BPCL
Deposited On:	28 Sep 2018 11:51
Last Modified:	28 Sep 2018 11:51
Supervisor(s):	Jena, Sanjay Kumar and Rath, Santanu Kumar

Repository Staff Only: item control page