A comparative study of deep learning-based image captioning models for violence description

González Martínez, Fernando

dc.contributor.advisor	Conant Pablos, Santiago Enrique
dc.contributor.author	González Martínez, Fernando
dc.creator	CONANT PABLOS, SANTIAGO ENRIQUE; 56551
dc.date.accessioned	2022-11-10T22:59:05Z
dc.date.available	2022-11-10T22:59:05Z
dc.date.issued	2019-10-10
dc.identifier.citation	González Martínez, F. (2022).A comparative study of deep learning-based image captioning models for violence description (Tesis de maestría). Instituto Tecnológico y de Estudios Superiores de Monterrey. Recuperado de: https://hdl.handle.net/11285/649872	es_MX
dc.identifier.uri	https://hdl.handle.net/11285/649872
dc.description	https://orcid.org/0000-0001-6270-3164	es_MX
dc.description.abstract	The safety and security of people will always hold one of the top positions for governments, countries, states, enterprises, and families. One of the greatest advances in the field of security technologies was the invention of surveillance cameras, giving public and private owners the possibility to observe recorded past events to protect their property. Giving undeniable proof of events that occurred when they were not present. It is safe to say that most corporations and some homes have some type of security technology, from the simplest surveillance system to more complicated technologies, such as facial and fingerprint recognition. With these types of security systems, there exists a drawback, the volume of data generates from each of them. When talking about surveillance cameras we have thousands of hours being recorded and stored for later access to review any past event. The problem arises when the volume of data generated surpasses the capability of humans to analyze it. However, should humans decide to analyze it, human errors become a factor too, as the quantity and nature of the data could overwhelm, and cause humans to miss an event that should not be missed. In this work, the events contain violence and suspicious behavior, such as robberies, assaults, street riots, and fights, among others. Thus, presenting the need for a system that can recognize such events happening and generate a brief description for a faster interpretation by the humans using the system. The field of image captioning and video captioning have been present in computer science for the past decade. Image captioning works by converting an image and words into features using deep learning models, combining them, and creating predictions from what the model believes should be the output for a given state. Given the time for which this task has existed, Image Captioning has been through many changes in the development of its models. The basic model utilizes convolutional neural networks for image analysis and recurrent neural networks for sentence analysis and generation. The addition of attention further improved the results from these models by teaching models where to focus when analyzing images and sentences. Finally, the creation of the Transformer, which has dominated the field in most tasks, thanks to the ability to perform most of its calculations in parallel, thus being faster than past models. The performance improvements can be seen thanks to previous works that are on top of the leaderboards for image recognition, text generation, and captioning. The purpose of this work is to create and train models to generate descriptions of normal and violent images. The models proposed in this work are Encoder-Decoder, Encoder-Decoder using Attention layers, and Transformers. The dataset used as a base for this work is the Flickr8k dataset. This dataset is a collection of around 8000 images with 5 descriptions each, obtained through human consultation. For this work, we extended the dataset to include violent images and their descriptions. The descriptions were obtained by asking a group of three persons to describe the image shown, mentioning subjects, objects, actions, and places as best they could. The images were retrieved by using Microsoft’s Bing API. The models were then evaluated using BLEU-N, METEOR, CIDEr, and ROUGE-L. These are machine translation evaluation metrics that are used to compare generated sentences to reference sentences and obtain an objective metric. Results show that the models can generate sentences that describe normal and violent images. However, the Soft-Attention model obtained the best performance over normal and violent images. Given our results, these models can generate descriptions of violent and normal images. The availability of these models could help analyze images found on the web, giving a brief description before opening images containing violent content. The results obtained can be used as a base to further improve these models and the possibility of creating models that can analyze violent videos. This could result in a system that is capable of analyzing images and videos in the background and generating a brief description of the events found in them, potentially leading to better reaction times from security and increased crime prevention.	es_MX
dc.format.medium	Texto	es_MX
dc.language.iso	eng	es_MX
dc.publisher	Instituto Tecnológico y de Estudios Superiores de Monterrey	es_MX
dc.relation.isFormatOf	acceptedVersion	es_MX
dc.relation.isreferencedby	REPOSITORIO NACIONAL CONACYT
dc.rights	openAccess	es_MX
dc.rights.uri	http://creativecommons.org/licenses/by/4.0	es_MX
dc.subject.classification	INGENIERÍA Y TECNOLOGÍA::CIENCIAS TECNOLÓGICAS::TECNOLOGÍA DE LA INSTRUMENTACIÓN::INSTRUMENTOS ÓPTICOS	es_MX
dc.subject.lcsh	Technology	es_MX
dc.title	A comparative study of deep learning-based image captioning models for violence description	es_MX
dc.type	Tesis de Maestría / master Thesis	es_MX
dc.contributor.department	Escuela de Ingeniería y Ciencias	es_MX
dc.contributor.committeemember	Hugo Terashima, Marín
dc.contributor.committeemember	González Mendoza, Miguel
dc.contributor.committeemember	Nimrod González, Franco
dc.identifier.orcid	https://orcid.org/0000-0002-2510-0767	es_MX
dc.subject.keyword	Image Captioning	es_MX
dc.subject.keyword	Deep Learning	es_MX
dc.subject.keyword	Transformer	es_MX
dc.subject.keyword	Attention	es_MX
dc.contributor.institution	Campus Monterrey	es_MX
dc.contributor.cataloger	emijzarate	es_MX
dc.description.degree	Master of Science in Computer Science	es_MX
dc.date.accepted	2022-06-06
dc.audience.educationlevel	Estudiantes/Students	es_MX
dc.audience.educationlevel	Investigadores/Researchers	es_MX
dc.audience.educationlevel	Maestros/Teachers	es_MX
dc.audience.educationlevel	Público en general/General public	es_MX
dc.identificator	7\|\|33\|\|3311\|\|331111	es_MX