Conceptual Exploration of Contextual Information for Situational Understanding
Authors: Stratis Aloimonos, Adrienne Raglin
Abstract: The Army is often required to deploy soldiers into dangerous situations to offer assistance and relief. When deployed, these soldiers need to be aware of the potential dangers, properly assess the level of possible threats, and make the best choices to respond. One solution for this problem space is to have an intelligent system that recognizes scenes which may contain danger, regardless of the type or timeframe associated with that danger. This type of system would help make decisions about what to do in situations where danger may be prevalent. Thus, creating an intelligent system that could identify the scene and contextual information, for example, potential dangers, would provide greater situational understanding and support autonomous systems and solider interactions. As a proxy for representing scenes that may be similar to those encountered by soldiers, a set of images of natural or manmade disasters were selected and used to identify strengths and weaknesses in existing models for this type of intelligent system. In this work, images from CRISISMMD, a dataset of natural disasters tweets, as well as other images of disasters in the public domain which do not belong to any particular dataset, are used. For the initial phase of the work this dataset was used to determine and showcase the strengths and weaknesses of existing object recognition and visual question answering systems that when combined would create a prototype intelligent system. Specifically, YOLO (You Only Look Once), augmented with Word2Vec (a natural language processing (NLP) system which finds the similarities of different words in a very large corpus) was selected for performing the object recognition (Bochkovskiy et al. 2020). This system was selected to identify objects further based on the presence of other, similar objects using the similarities between their names. Also, CLIP (Contrastive Language Image Pretraining), which identifies the probabilities of scenes based on a certain number of possibilities and BLIP (Bootstrapping Language Image Pretraining) (Li et al. 2022), an advanced visual question answering system which is also capable of generating captions for images were explored. In addition, a concept of an intelligent system where contextual information is identified and utilized can be used to support situational understanding.
Keywords: Object Recognition, Visual Question Answering, Scene Identification
Cite this paper: