Show simple item record

dc.contributor.advisorConant Pablos, Santiago Enrique
dc.contributor.authorGonzález Martínez, Fernando
dc.creatorCONANT PABLOS, SANTIAGO ENRIQUE; 56551
dc.date.accessioned2022-11-10T22:59:05Z
dc.date.available2022-11-10T22:59:05Z
dc.date.issued2019-10-10
dc.identifier.citationGonzález Martínez, F. (2022).A comparative study of deep learning-based image captioning models for violence description (Tesis de maestría). Instituto Tecnológico y de Estudios Superiores de Monterrey. Recuperado de: https://hdl.handle.net/11285/649872es_MX
dc.identifier.urihttps://hdl.handle.net/11285/649872
dc.descriptionhttps://orcid.org/0000-0001-6270-3164es_MX
dc.description.abstractThe safety and security of people will always hold one of the top positions for governments, countries, states, enterprises, and families. One of the greatest advances in the field of security technologies was the invention of surveillance cameras, giving public and private owners the possibility to observe recorded past events to protect their property. Giving undeniable proof of events that occurred when they were not present. It is safe to say that most corporations and some homes have some type of security technology, from the simplest surveillance system to more complicated technologies, such as facial and fingerprint recognition. With these types of security systems, there exists a drawback, the volume of data generates from each of them. When talking about surveillance cameras we have thousands of hours being recorded and stored for later access to review any past event. The problem arises when the volume of data generated surpasses the capability of humans to analyze it. However, should humans decide to analyze it, human errors become a factor too, as the quantity and nature of the data could overwhelm, and cause humans to miss an event that should not be missed. In this work, the events contain violence and suspicious behavior, such as robberies, assaults, street riots, and fights, among others. Thus, presenting the need for a system that can recognize such events happening and generate a brief description for a faster interpretation by the humans using the system. The field of image captioning and video captioning have been present in computer science for the past decade. Image captioning works by converting an image and words into features using deep learning models, combining them, and creating predictions from what the model believes should be the output for a given state. Given the time for which this task has existed, Image Captioning has been through many changes in the development of its models. The basic model utilizes convolutional neural networks for image analysis and recurrent neural networks for sentence analysis and generation. The addition of attention further improved the results from these models by teaching models where to focus when analyzing images and sentences. Finally, the creation of the Transformer, which has dominated the field in most tasks, thanks to the ability to perform most of its calculations in parallel, thus being faster than past models. The performance improvements can be seen thanks to previous works that are on top of the leaderboards for image recognition, text generation, and captioning. The purpose of this work is to create and train models to generate descriptions of normal and violent images. The models proposed in this work are Encoder-Decoder, Encoder-Decoder using Attention layers, and Transformers. The dataset used as a base for this work is the Flickr8k dataset. This dataset is a collection of around 8000 images with 5 descriptions each, obtained through human consultation. For this work, we extended the dataset to include violent images and their descriptions. The descriptions were obtained by asking a group of three persons to describe the image shown, mentioning subjects, objects, actions, and places as best they could. The images were retrieved by using Microsoft’s Bing API. The models were then evaluated using BLEU-N, METEOR, CIDEr, and ROUGE-L. These are machine translation evaluation metrics that are used to compare generated sentences to reference sentences and obtain an objective metric. Results show that the models can generate sentences that describe normal and violent images. However, the Soft-Attention model obtained the best performance over normal and violent images. Given our results, these models can generate descriptions of violent and normal images. The availability of these models could help analyze images found on the web, giving a brief description before opening images containing violent content. The results obtained can be used as a base to further improve these models and the possibility of creating models that can analyze violent videos. This could result in a system that is capable of analyzing images and videos in the background and generating a brief description of the events found in them, potentially leading to better reaction times from security and increased crime prevention.es_MX
dc.format.mediumTextoes_MX
dc.language.isoenges_MX
dc.publisherInstituto Tecnológico y de Estudios Superiores de Monterreyes_MX
dc.relation.isFormatOfacceptedVersiones_MX
dc.relation.isreferencedbyREPOSITORIO NACIONAL CONACYT
dc.rightsopenAccesses_MX
dc.rights.urihttp://creativecommons.org/licenses/by/4.0es_MX
dc.subject.classificationINGENIERÍA Y TECNOLOGÍA::CIENCIAS TECNOLÓGICAS::TECNOLOGÍA DE LA INSTRUMENTACIÓN::INSTRUMENTOS ÓPTICOSes_MX
dc.subject.lcshTechnologyes_MX
dc.titleA comparative study of deep learning-based image captioning models for violence descriptiones_MX
dc.typeTesis de Maestría / master Thesises_MX
dc.contributor.departmentEscuela de Ingeniería y Cienciases_MX
dc.contributor.committeememberHugo Terashima, Marín
dc.contributor.committeememberGonzález Mendoza, Miguel
dc.contributor.committeememberNimrod González, Franco
dc.identifier.orcidhttps://orcid.org/0000-0002-2510-0767es_MX
dc.subject.keywordImage Captioninges_MX
dc.subject.keywordDeep Learninges_MX
dc.subject.keywordTransformeres_MX
dc.subject.keywordAttentiones_MX
dc.contributor.institutionCampus Monterreyes_MX
dc.contributor.catalogeremijzaratees_MX
dc.description.degreeMaster of Science in Computer Sciencees_MX
dc.date.accepted2022-06-06
dc.audience.educationlevelEstudiantes/Studentses_MX
dc.audience.educationlevelInvestigadores/Researcherses_MX
dc.audience.educationlevelMaestros/Teacherses_MX
dc.audience.educationlevelPúblico en general/General publices_MX
dc.identificator7||33||3311||331111es_MX


Files in this item

Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

  • Ciencias Sociales 4046
    Gobierno y Transformación Pública / Humanidades y Educación / Negocios / Arquitectura y Diseño / EGADE Business School

Show simple item record

openAccess
Except where otherwise noted, this item's license is described as openAccess