A comparative study of deep learning-based image captioning models for violence description

González Martínez, Fernando

View/Open

Master Theses Size (8.385Mb)

Carta autorizacion Size (143.3Kb)

Hoja de firmas Size (708.9Kb)

Date

2019-10-10

Author

González Martínez, Fernando

Metadata

Show full item record

Export citation

Abstract

The safety and security of people will always hold one of the top positions for governments, countries, states, enterprises, and families. One of the greatest advances in the field of security technologies was the invention of surveillance cameras, giving public and private owners the possibility to observe recorded past events to protect their property. Giving undeniable proof of events that occurred when they were not present. It is safe to say that most corporations and some homes have some type of security technology, from the simplest surveillance system to more complicated technologies, such as facial and fingerprint recognition. With these types of security systems, there exists a drawback, the volume of data generates from each of them. When talking about surveillance cameras we have thousands of hours being recorded and stored for later access to review any past event. The problem arises when the volume of data generated surpasses the capability of humans to analyze it. However, should humans decide to analyze it, human errors become a factor too, as the quantity and nature of the data could overwhelm, and cause humans to miss an event that should not be missed. In this work, the events contain violence and suspicious behavior, such as robberies, assaults, street riots, and fights, among others. Thus, presenting the need for a system that can recognize such events happening and generate a brief description for a faster interpretation by the humans using the system. The field of image captioning and video captioning have been present in computer science for the past decade. Image captioning works by converting an image and words into features using deep learning models, combining them, and creating predictions from what the model believes should be the output for a given state. Given the time for which this task has existed, Image Captioning has been through many changes in the development of its models. The basic model utilizes convolutional neural networks for image analysis and recurrent neural networks for sentence analysis and generation. The addition of attention further improved the results from these models by teaching models where to focus when analyzing images and sentences. Finally, the creation of the Transformer, which has dominated the field in most tasks, thanks to the ability to perform most of its calculations in parallel, thus being faster than past models. The performance improvements can be seen thanks to previous works that are on top of the leaderboards for image recognition, text generation, and captioning. The purpose of this work is to create and train models to generate descriptions of normal and violent images. The models proposed in this work are Encoder-Decoder, Encoder-Decoder using Attention layers, and Transformers. The dataset used as a base for this work is the Flickr8k dataset. This dataset is a collection of around 8000 images with 5 descriptions each, obtained through human consultation. For this work, we extended the dataset to include violent images and their descriptions. The descriptions were obtained by asking a group of three persons to describe the image shown, mentioning subjects, objects, actions, and places as best they could. The images were retrieved by using Microsoft’s Bing API. The models were then evaluated using BLEU-N, METEOR, CIDEr, and ROUGE-L. These are machine translation evaluation metrics that are used to compare generated sentences to reference sentences and obtain an objective metric. Results show that the models can generate sentences that describe normal and violent images. However, the Soft-Attention model obtained the best performance over normal and violent images. Given our results, these models can generate descriptions of violent and normal images. The availability of these models could help analyze images found on the web, giving a brief description before opening images containing violent content. The results obtained can be used as a base to further improve these models and the possibility of creating models that can analyze violent videos. This could result in a system that is capable of analyzing images and videos in the background and generating a brief description of the events found in them, potentially leading to better reaction times from security and increased crime prevention.

URI

https://hdl.handle.net/11285/649872

Collections

Ciencias Sociales 4046

Except where otherwise noted, this item's license is described as openAccess