dc.contributor.advisor	Terashima Marín, Hugo
dc.creator	Martínez Eguiarte, Samuel Jesús
dc.creator	895622
dc.date.accessioned	2020-03-14T01:27:15Z
dc.date.available	2020-03-14T01:27:15Z
dc.date.created	2019-10
dc.identifier.citation	Martínez, S. (2019). Using Data Mining Techniques to Solve the Web Classification Problem in Real Scenarios.	es_MX
dc.identifier.uri	http://hdl.handle.net/11285/636281
dc.description.abstract	The following thesis describes an investigation that aims to solve the web classification problem with real scenarios. The main motivation of the project is to be able to fully automate the testing of different types of web pages. For this reason, a total of four classes were defined as: Login, Search, Form, and Article. These four classes represent basic testing scenarios. The main goal is to correctly identify each example as one of these classes. A dataset was initially created containing 2,000 different examples (500 examples for each class). The dataset contains the tag elements of a web page as features. These tag elements show the frequency a tag appears in a specific web page. Since this problem can be also viewed as a text classification problem, where every tag element represents a different word and a web page represents a different document, one goal of the project is to determine if the tag dataset could be used instead of the text dataset. For this reason, another dataset was created (using the same 2000 examples), in which the plain text of the web pages was extracted in order to apply text classification techniques. The classification models selected were: Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Multinomial Naive Bayes (MNB), and a simple Neural Network (NN). These models were also defined as One-class models in order to see if the models could be trained using positive data only. In the end, three experiments were performed. One using the One-class models with the tag dataset; other using the same tag dataset but with Multi-class models; and one last one using the text dataset with both One-class and Multi-class models. To obtain all of these 2000 scenarios, a web-scrapping tool was created to obtain both the tag frequency and the plain text of a web page. As it was mentioned previously, four classes were defined with 500 examples for each class. A baseline was created in order to compare the results of the models. This baseline uses an algorithm called Web Page Classification Algorithm Based on Feature Selection (WCAFS). The results of the baseline yielded a score of 0.88. The One-class models do not show a score similar to the one established in the baseline model, except for one. The One-class Article SVM model, with the use of the tag dataset, had a score of 0.88. This was the only model, that uses the tag dataset, that achieved the same score as the baseline. The Multi-class models only achieved the same result as the baseline when the text dataset was applied. So, the best configuration to use is a Multi-class model with a TF-IDF (Terms Frequency - Inverse Document Frequency) transformation applied to a text dataset.	es_MX
dc.format.medium	Texto	es_MX
dc.publisher	Instituto Tecnológico y de Estudios Superiores de Monterrey	esp
dc.relation.isFormatOf	versión publicada	es_MX
dc.rights	Open Access	es_MX
dc.rights	Atribución 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	*
dc.subject.lcsh	Science	es_MX
dc.title	Using Data Mining Techniques to Solve the Web Classification Problem in Real Scenarios	es_MX
dc.type	Trabajo de grado, Maestría / master Degree Work	es_MX
dc.contributor.mentor	Rosales Pérez, Alejandro
dc.publisher.institution	Instituto Tecnológico y de Estudios Superiores de Monterrey	es_MX
dc.subject.keyword	Web Mining	es_MX
dc.subject.keyword	Web Classification	es_MX
dc.subject.keyword	Web pages	es_MX
dc.subject.keyword	Data Mining	es_MX
dc.subject.keyword	Text Classification	es_MX
dc.contributor.institution	Escuela de Ingeniería y Ciencias	es_MX
dc.contributor.institution	Escuela de Ingeniería y Ciencias	es_MX
dc.contributor.institution	Campus Monterrey	es_MX
dc.description.degree	Maestro en Ciencias Computacionales	es_MX
dc.audience.educationlevel	Investigadores/Researchers	es_MX
dc.relation.impreso	2019-11-13

Using Data Mining Techniques to Solve the Web Classification Problem in Real Scenarios

Files in this item

This item appears in the following Collection(s)