Using Data Mining Techniques to Solve the Web Classification Problem in Real Scenarios
Martínez Eguiarte, Samuel Jesús
MetadataShow full item record
The following thesis describes an investigation that aims to solve the web classification problem with real scenarios. The main motivation of the project is to be able to fully automate the testing of different types of web pages. For this reason, a total of four classes were defined as: Login, Search, Form, and Article. These four classes represent basic testing scenarios. The main goal is to correctly identify each example as one of these classes. A dataset was initially created containing 2,000 different examples (500 examples for each class). The dataset contains the tag elements of a web page as features. These tag elements show the frequency a tag appears in a specific web page. Since this problem can be also viewed as a text classification problem, where every tag element represents a different word and a web page represents a different document, one goal of the project is to determine if the tag dataset could be used instead of the text dataset. For this reason, another dataset was created (using the same 2000 examples), in which the plain text of the web pages was extracted in order to apply text classification techniques. The classification models selected were: Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Multinomial Naive Bayes (MNB), and a simple Neural Network (NN). These models were also defined as One-class models in order to see if the models could be trained using positive data only. In the end, three experiments were performed. One using the One-class models with the tag dataset; other using the same tag dataset but with Multi-class models; and one last one using the text dataset with both One-class and Multi-class models. To obtain all of these 2000 scenarios, a web-scrapping tool was created to obtain both the tag frequency and the plain text of a web page. As it was mentioned previously, four classes were defined with 500 examples for each class. A baseline was created in order to compare the results of the models. This baseline uses an algorithm called Web Page Classification Algorithm Based on Feature Selection (WCAFS). The results of the baseline yielded a score of 0.88. The One-class models do not show a score similar to the one established in the baseline model, except for one. The One-class Article SVM model, with the use of the tag dataset, had a score of 0.88. This was the only model, that uses the tag dataset, that achieved the same score as the baseline. The Multi-class models only achieved the same result as the baseline when the text dataset was applied. So, the best configuration to use is a Multi-class model with a TF-IDF (Terms Frequency - Inverse Document Frequency) transformation applied to a text dataset.
The following license files are associated with this item: