Machine Learning Techniques For Detecting Untrusted Pages on the Web

Singh, Vikash Kumar (2009) Machine Learning Techniques For Detecting Untrusted Pages on the Web. BTech thesis.

[img]
Preview
PDF
1591Kb

Abstract

The Web is both an excellent medium for sharing information, as well as an attractive platform for delivering products and services. This platform is, to some extent, mediated by search engines in order to meet the needs of users seeking information. Search engines are the “dragons” that keep a valuable treasure: information.

Many web pages are unscrupulous and try to fool search engines to get to the top of ranking. The goal of this project is to detect such spam pages. We will particularly consider content spam and link spam, where untrusted pages use link structure to increase their importance. We pose this as a machine learning problem and build a classifier to classify pages into two category - trustworthy and untrusted .We use different link features, in other words structural characteristics of the web graph and content based features, as input to the classifier.

We propose link-based techniques and context based techniques for automating the detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. We propose Naïve Bayesian Classifier to detect the content Spam and PageRank and TrustRank to detect the link spam.

Item Type:Thesis (BTech)
Uncontrolled Keywords:Spam Pages,Web
Subjects:Engineering and Technology > Computer and Information Science > Data Mining
Divisions: Engineering and Technology > Department of Computer Science
ID Code:1337
Deposited By:Vikash Kumar Singh
Deposited On:19 May 2009 08:48
Last Modified:19 May 2009 08:48
Related URLs:
Supervisor(s):Jena, S K

Repository Staff Only: item control page