A comparative study of deep learning-based image captioning models for violence description
Export citation
Abstract
The safety and security of people will always hold one of the top positions for
governments, countries, states, enterprises, and families. One of the greatest advances
in the field of security technologies was the invention of surveillance cameras, giving
public and private owners the possibility to observe recorded past events to protect their
property. Giving undeniable proof of events that occurred when they were not present.
It is safe to say that most corporations and some homes have some type of security
technology, from the simplest surveillance system to more complicated technologies,
such as facial and fingerprint recognition. With these types of security systems, there
exists a drawback, the volume of data generates from each of them. When talking
about surveillance cameras we have thousands of hours being recorded and stored for
later access to review any past event. The problem arises when the volume of data
generated surpasses the capability of humans to analyze it. However, should humans
decide to analyze it, human errors become a factor too, as the quantity and nature of the
data could overwhelm, and cause humans to miss an event that should not be missed.
In this work, the events contain violence and suspicious behavior, such as robberies,
assaults, street riots, and fights, among others. Thus, presenting the need for a system
that can recognize such events happening and generate a brief description for a faster
interpretation by the humans using the system.
The field of image captioning and video captioning have been present in computer
science for the past decade. Image captioning works by converting an image and words
into features using deep learning models, combining them, and creating predictions
from what the model believes should be the output for a given state. Given the time
for which this task has existed, Image Captioning has been through many changes in
the development of its models. The basic model utilizes convolutional neural networks
for image analysis and recurrent neural networks for sentence analysis and generation.
The addition of attention further improved the results from these models by teaching
models where to focus when analyzing images and sentences. Finally, the creation of
the Transformer, which has dominated the field in most tasks, thanks to the ability to
perform most of its calculations in parallel, thus being faster than past models. The
performance improvements can be seen thanks to previous works that are on top of the
leaderboards for image recognition, text generation, and captioning.
The purpose of this work is to create and train models to generate descriptions of
normal and violent images. The models proposed in this work are Encoder-Decoder,
Encoder-Decoder using Attention layers, and Transformers. The dataset used as a base
for this work is the Flickr8k dataset. This dataset is a collection of around 8000 images
with 5 descriptions each, obtained through human consultation. For this work, we extended
the dataset to include violent images and their descriptions. The descriptions
were obtained by asking a group of three persons to describe the image shown, mentioning
subjects, objects, actions, and places as best they could. The images were retrieved
by using Microsoft’s Bing API. The models were then evaluated using BLEU-N, METEOR,
CIDEr, and ROUGE-L. These are machine translation evaluation metrics that
are used to compare generated sentences to reference sentences and obtain an objective
metric. Results show that the models can generate sentences that describe normal and
violent images. However, the Soft-Attention model obtained the best performance over
normal and violent images.
Given our results, these models can generate descriptions of violent and normal
images. The availability of these models could help analyze images found on the web,
giving a brief description before opening images containing violent content. The results
obtained can be used as a base to further improve these models and the possibility of
creating models that can analyze violent videos. This could result in a system that
is capable of analyzing images and videos in the background and generating a brief
description of the events found in them, potentially leading to better reaction times
from security and increased crime prevention.
Collections
- Ciencias Sociales 3829