Show simple item record

dc.contributor.advisorDr. Ramón F. Brenaes
dc.creatorRamírez Rangel, Eduardo H.en
dc.date.accessioned2015-08-17T11:35:09Zen
dc.date.available2015-08-17T11:35:09Zen
dc.date.issued2010-01-12
dc.identifier.urihttp://hdl.handle.net/11285/572556en
dc.description.abstractCreating topic models of text collections is an important step towards more adaptive information access and retrieval applications. Such models encode knowledge of the topics discussed on a collection, the documents that belong to each topic and the semantic similarity of a given pair of topics. Among other things, they can be used to focus or disambiguate search queries and construct visualizations to navigate across the collection. So far, the dominant paradigm to topic modeling has been the Probabilistic Topic Modeling approach in which topics are represented as probability distributions of terms, and documents are assumed to be generated from a mixture of random topics. Although such models are theoretically sound, their high computational complexity makes them difficult to use in very large scale collections. In this work we propose an alternative topic modeling paradigm based on a simpler representation of topics as freely overlapping clusters of semantically similar documents, that is able to take advantage of highly-scalable clustering algorithms. Then, we propose the Querybased Topic Modeling framework (QTM), an information-theoretic method that assumes the existence of a "golden" set of queries that can capture most of the semantic information of the collection and produce models with máximum semantic coherence. The QTM method uses information-theoretic heuristics to find a set of "topical-queries" which are then co-clustered along with the documents of the collection and transformed to produce overlapping document clusters. The QTM framework was designed with scalability in mind and is able to be executed in parallel over commodity-class machines using the Map-Reduce approach. Then, in order to compare the QTM results with models generated by other methods we have developed metrics that formalize the notion of semantic coherence using probabilistic concepts and the familiar notions of recall and precisión. In contrast to traditional clustering metrics, the proposed metrics have been generalized to validate overlapping and potentially incomplete clustering solutions using multi-labeled corpora. We use them to experimentally validate our query-based approach, showing that models produced using selected queries outperform the ones produced using the collection vocabulary. Also, we explore the heuristics and settings that determine the performance of QTM and show that the proposed method can produce models of comparable, or even superior quality, than those produced with state of the art probabilistic methods.
dc.languageeng
dc.publisherInstituto Tecnológico y de Estudios Superiores de Monterrey
dc.rightsinfo:eu-repo/semantics/openAccess
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0*
dc.titleLarge Scale Topic Modeling Using Search Queries: An Information-Theoric Approachen
dc.typeTesis de doctorado
thesis.degree.levelDoctor of Philosophyen
dc.contributor.committeememberDra. Alma Delia Cuevases
dc.contributor.committeememberDr. José Ignacio Leazaes
dc.contributor.committeememberDr. Leonardo Garridoes
dc.contributor.committeememberDr. Randy Goebeles
thesis.degree.disciplineSchool Of Engineeringen
thesis.degree.nameGraduate Programs in Mechatronics and Information Technologiesen
dc.subject.keywordQueriesen
dc.subject.keywordQtmen
thesis.degree.programCampus Monterreyes
dc.subject.disciplineIngeniería y Ciencias Aplicadas / Engineering & Applied Scienceses
refterms.dateFOA2018-03-07T07:37:32Z
refterms.dateFOA2018-03-07T07:37:32Z
html.description.abstractCreating topic models of text collections is an important step towards more adaptive information access and retrieval applications. Such models encode knowledge of the topics discussed on a collection, the documents that belong to each topic and the semantic similarity of a given pair of topics. Among other things, they can be used to focus or disambiguate search queries and construct visualizations to navigate across the collection. So far, the dominant paradigm to topic modeling has been the Probabilistic Topic Modeling approach in which topics are represented as probability distributions of terms, and documents are assumed to be generated from a mixture of random topics. Although such models are theoretically sound, their high computational complexity makes them difficult to use in very large scale collections. In this work we propose an alternative topic modeling paradigm based on a simpler representation of topics as freely overlapping clusters of semantically similar documents, that is able to take advantage of highly-scalable clustering algorithms. Then, we propose the Querybased Topic Modeling framework (QTM), an information-theoretic method that assumes the existence of a "golden" set of queries that can capture most of the semantic information of the collection and produce models with máximum semantic coherence. The QTM method uses information-theoretic heuristics to find a set of "topical-queries" which are then co-clustered along with the documents of the collection and transformed to produce overlapping document clusters. The QTM framework was designed with scalability in mind and is able to be executed in parallel over commodity-class machines using the Map-Reduce approach. Then, in order to compare the QTM results with models generated by other methods we have developed metrics that formalize the notion of semantic coherence using probabilistic concepts and the familiar notions of recall and precisión. In contrast to traditional clustering metrics, the proposed metrics have been generalized to validate overlapping and potentially incomplete clustering solutions using multi-labeled corpora. We use them to experimentally validate our query-based approach, showing that models produced using selected queries outperform the ones produced using the collection vocabulary. Also, we explore the heuristics and settings that determine the performance of QTM and show that the proposed method can produce models of comparable, or even superior quality, than those produced with state of the art probabilistic methods.


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

info:eu-repo/semantics/openAccess
Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess