Show simple item record

dc.contributor.advisorTerashima Marín, Hugo
dc.creatorMartínez Eguiarte, Samuel Jesús
dc.creator895622
dc.date.accessioned2020-03-14T01:27:15Z
dc.date.available2020-03-14T01:27:15Z
dc.date.created2019-10
dc.identifier.citationMartínez, S. (2019). Using Data Mining Techniques to Solve the Web Classification Problem in Real Scenarios.es_MX
dc.identifier.urihttp://hdl.handle.net/11285/636281
dc.description.abstractThe following thesis describes an investigation that aims to solve the web classification problem with real scenarios. The main motivation of the project is to be able to fully automate the testing of different types of web pages. For this reason, a total of four classes were defined as: Login, Search, Form, and Article. These four classes represent basic testing scenarios. The main goal is to correctly identify each example as one of these classes. A dataset was initially created containing 2,000 different examples (500 examples for each class). The dataset contains the tag elements of a web page as features. These tag elements show the frequency a tag appears in a specific web page. Since this problem can be also viewed as a text classification problem, where every tag element represents a different word and a web page represents a different document, one goal of the project is to determine if the tag dataset could be used instead of the text dataset. For this reason, another dataset was created (using the same 2000 examples), in which the plain text of the web pages was extracted in order to apply text classification techniques. The classification models selected were: Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Multinomial Naive Bayes (MNB), and a simple Neural Network (NN). These models were also defined as One-class models in order to see if the models could be trained using positive data only. In the end, three experiments were performed. One using the One-class models with the tag dataset; other using the same tag dataset but with Multi-class models; and one last one using the text dataset with both One-class and Multi-class models. To obtain all of these 2000 scenarios, a web-scrapping tool was created to obtain both the tag frequency and the plain text of a web page. As it was mentioned previously, four classes were defined with 500 examples for each class. A baseline was created in order to compare the results of the models. This baseline uses an algorithm called Web Page Classification Algorithm Based on Feature Selection (WCAFS). The results of the baseline yielded a score of 0.88. The One-class models do not show a score similar to the one established in the baseline model, except for one. The One-class Article SVM model, with the use of the tag dataset, had a score of 0.88. This was the only model, that uses the tag dataset, that achieved the same score as the baseline. The Multi-class models only achieved the same result as the baseline when the text dataset was applied. So, the best configuration to use is a Multi-class model with a TF-IDF (Terms Frequency - Inverse Document Frequency) transformation applied to a text dataset.es_MX
dc.format.mediumTextoes_MX
dc.publisherInstituto Tecnológico y de Estudios Superiores de Monterreyesp
dc.relation.isFormatOfversión publicadaes_MX
dc.rightsOpen Accesses_MX
dc.rightsAtribución 4.0 Internacional*
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/*
dc.subject.lcshSciencees_MX
dc.titleUsing Data Mining Techniques to Solve the Web Classification Problem in Real Scenarioses_MX
dc.typeTrabajo de grado, Maestría / master Degree Workes_MX
dc.contributor.mentorRosales Pérez, Alejandro
dc.publisher.institutionInstituto Tecnológico y de Estudios Superiores de Monterreyes_MX
dc.subject.keywordWeb Mininges_MX
dc.subject.keywordWeb Classificationes_MX
dc.subject.keywordWeb pageses_MX
dc.subject.keywordData Mininges_MX
dc.subject.keywordText Classificationes_MX
dc.contributor.institutionEscuela de Ingeniería y Cienciases_MX
dc.contributor.institutionEscuela de Ingeniería y Cienciases_MX
dc.contributor.institutionCampus Monterreyes_MX
dc.description.degreeMaestro en Ciencias Computacionaleses_MX
dc.audience.educationlevelInvestigadores/Researcherses_MX
dc.relation.impreso2019-11-13


Files in this item

Thumbnail
Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Open Access
Except where otherwise noted, this item's license is described as Open Access