Jaskaran Singh Sodhi

ROMEO: Rational Optimized Multi-task Enhancement framework for VQA

This blog-form illustrates what is extensively covered in the corresponding report.

Abstract: This work explores (i) a self-refinement method to ground VQA respondes to the query image (to reduce hallucinations), (ii) a multimodal routing framework to learn failure patterns in SoTA models, in the realm of common sense reasoning for self-driving.

Table of Contents

  1. Introduction
  2. Problem Setup and Dataset
  3. Baselines
  4. ROMEO
  5. Multimodal Router
  6. Qualitative Results and Theories

Introduction

In the rapidly evolving domain of autonomous vehicles (AVs), understanding and reasoning about visual scenes is paramount. My recent work on the ROMEO framework, developed as part of a collaborative project, tackles some of the most complex challenges in Visual Question Answering (VQA) for handling on-road anomalies. This project was also inspired by my discussion with Shubham during my internship at Kodiak.

vqa_intro

Challenges in Knowledge-Intensive VQA

Autonomous vehicles must process a diverse range of inputs, from identifying road signs in dim lighting to discerning subtle environmental cues like animal crossing warnings or unusual traffic patterns. Current VQA systems face limitations in:

These challenges are particularly significant for self-driving systems, where errors can compromise safety.

Introducing ROMEO: A Rational-Optimized Framework

ROMEO (Rational Optimized Multi-task Enhancement framework) introduces innovations tailored for knowledge-intensive tasks like those encountered in autonomous driving:

  1. Self-Refinement: ROMEO iteratively improves its understanding by aligning visual and textual data, enabling nuanced reasoning about complex scenes, such as determining whether a street is one-way based on visual context. This is done by introducing a novel self-refinement loss that is used to ground the pooled image representation with the pooled text representation of the generated tokens.

  2. Multimodal Routing: The framework dynamically selects the best visual-language model (VLM) to optimize performance and cost, crucial for real-time applications in AVs. We explore such a best-model-selection framework to learn failure patterns across models.

  3. Rationale Generation: ROMEO not only answers questions but also provides detailed explanations, enhancing interpretability. For instance, it can explain why it identified a school zone sign and how it impacts recommended speed adjustments. This prediction capability is unlocked due to our choice of the A-OKVQA dataset which also provides ground-truth (or user-annotated) rationales for each VQA sample.

Applications in Autonomous Driving

The ROMEO framework was developed motivated by edge cases I perceived in self-driving as well as advances made in VLM technology during my summer at Kodiak:

Problem Setup and Dataset

Metrics

We follow A-OKVQA authors and use the same metrics for direct comparison referencing recent work:

Compute Requirements

Table below shows dataset statistics corresponding to the visual data from COCO 2017 dataset used for A-OKVQA.

Split Images Size
Train 17,056 878.09 MB
Val 1,122 57.44 MB
Test 6,702 321.60 MB

Visual Modality Analysis

The A-OKVQA dataset uses images from the COCO 2017 dataset. We performed visual analysis by running object detection on the validation split of the A-OKVQA dataset. The 1145 validation set questions make use of 1122 COCO 2017 images. We ran the YOLOV8 Small Object Detector these images. Some statistics related number of detected objects to the images are:

object_detections_plot 000000548339

Baselines

Model Method Type MCQ DA
T5 Unimodal 12.54 0.00
GPT2 Unimodal 20.44 2.02
CLIP Simple Multimodal 22.63 0.00
VilBERT Simple Multimodal 22.18 8.32
ClipCap Simple Multimodal 56.93 30.89
Llama 3.2 11B SoTA Multimodal 83.05 59.59
Llava-1.5 SoTA Multimodal 32.13 29.86
Molmo 7B SoTA Multimodal 58.14 64.46
GPT 4o mini SoTA Multimodal 83.58 41.33
Pali-Gemma 3B SoTA Multimodal 77.72 47.16

We also evaluate heatmaps showing the number of questions that each model answered correctly (rows) but were answered incorrectly by other models (columns) in both DA and MC settings.

da_final_cm mc_final_cm

From the heatmaps (Figures 4 and 5), we grouped the SoTA multimodal models into two categories:

We also perform a study to analyze how the number of objects present in a scene affects the VQA perforamnce of different strong models.

Perf_vs_num_objects

We found that across models the percentage of questions answered does not get affected much by the number of objects present in the scene.

Proposed Model: ROMEO

We present the architecture of ROMEO in Figure 1, a rational-optimized multi-task learning framework tailored for enhanced Visual Question Answering (VQA). The core innovation lies in generating intermediate rational explanations for questions about images, which subsequently guide the model in providing accurate answers. This rational generation enables the framework to better understand visual context and improve response accuracy. For more details on the training objective and loss function math, check out Section 5.2 in the report.

architecture

The Vision Transformer

The pipeline begins with a vision transformer that encodes the input image into patch embeddings (p1, p2, …), extracting essential visual features. These embeddings are then projected into the high-dimensional text embedding space using a Visual Mapper, ensuring compatibility with the LLM’s representation. Simultaneously, the Text Embedder processes the instruction prompt tokens (t1, t2, …). These textual tokens are combined with visual features to form structured input prompts that include the instruction template and VQA question.

The LLM

The LLM processes the input autoregressively, first generating detailed rational explanations as intermediate outputs (r1, r2, …), and then producing the final answer embeddings (a1, a2, …). These two stages—rational generation and answer generation—are integrated into a unified pipeline. To minimize errors and hallucinations, a self-refining loss mechanism aligns visual and textual representations while ensuring coherence in the generated explanations and answers.

Training Objective

The training objective is formulated as a composite loss function that optimizes three interdependent tasks: rational generation (L_rationale), VQA answer generation (Lvqa), and self-refinement (L_refine). By leveraging multi-task learning, ROMEO ensures that shared representations across tasks enable more robust, contextually grounded, and generalizable model outputs. The rationale generation acts as a crucial bridge between visual understanding and accurate answering, fostering deeper reasoning capabilities.

Performance of ROMEO on Rational Generation

Method Type BLEU-4 ROUGE
GPT 4o 16.47 44.69
Llama 3.2 11B 13.39 39.55
PaliGemma 3B 38.02 74.56
ROMEO(RationalAnswer) 52.46 58.38

Performance of ROMEO on Visual Question Answering

Model Method Type MCQ DA
T5 Unimodal 12.54 0.00
GPT2 Unimodal 20.44 2.02
CLIP Simple Multimodal 22.63 0.00
VilBERT Simple Multimodal 22.18 8.32
ClipCap Simple Multimodal 56.93 30.89
Llama 3.2 11B SoTA Multimodal 83.05 59.59
Llava-1.5 SoTA Multimodal 32.13 29.86
Molmo 7B SoTA Multimodal 58.14 64.46
GPT 4o mini SoTA Multimodal 83.58 41.33
Pali-Gemma 3B SoTA Multimodal 77.72 47.16
ROMEO (Direct Answer) STL Multimodal 76.92 48.64
ROMEO (Rationale Answer) MTL Multimodal 83.25 63.78

where STL is Single Task Learning and MTL is Multi-Task Learning (since we learn the rationale as well).

Multimodal Router

We pose model routing as a binary classification problem. Given an open-source VLM O and a proprietary VLM V, the router predicts the best VLM that for a given VQA query. While the open-source VLMs have lower inference cost as compared to the proprietary VLMs, they have weaker performance as well. Thus, to mitigate this cost-accuracy tradeoff, we want our router to capture the patterns in queries using which it makes the best possible choice of VLMs.

architecture

Results

One limitation is that our router is designed to operate within single decision-making paradigm, such as cost or performance, rather than integrating and balancing multiple paradigms. In future work, it would be interesting to explore how we can extend our routing method to incorporate multiple decision-making paradigms. We would also explore the varying the decision thresholds effects cost-accuracy trade-offs.

Our router was trained to route prompts to either GPT-4o mini or Llama 3.2 11B such that we minimize cost while retaining performance. The router’s objective to route to the cheaper model (Llama) whenever it was correct and to route to the expensive model (GPT-4o) only in cases where the cheaper model was incorrect, but the expensive model was correct. When we created router labels for the A-OKVQA train set, we ended up with 14190 questions routed to Llama while only 1679 questions routed to GPT-4o. The number of usable training samples was not sufficient due to the class imbalance leading to a router that had nearly random performance. In future work, we plan to mitigate this by exploring data augmentation techniques and using other VQA datasets to obtain sufficient samples to train a better router.

Qualitative Results and Theories

We observe several key patterns amongst the chosen models:

  1. Failure in Dark Images and Complex Reasoning: Across multiple examples, models struggle with dark images and tasks requiring complex reasoning. For instance, predicting the behavior of a cat or identifying distant objects like shampoo bottles remains challenging. ROMEO shows improvement by generating rational explanations that enhance understanding, though it still faces difficulties in dim lighting conditions.

  2. Weak Models Outperforming Strong Models in Specific Cases: In simpler scenes, weaker models such as MolMo 7B and LLaVA 1.5 7B occasionally outperform stronger models like GPT4o and LLaMA. This is often due to less complex reasoning requirements or domain-specific knowledge being less critical, allowing weaker models to perform adequately.

  3. Cases Requiring OCR, Detailed Reasoning, and Spatial Analysis: Strong models generally excel in tasks requiring Optical Character Recognition (OCR), detailed reasoning about object interactions, or spatial analysis. ROMEO’s integration of rational generation aids in handling these complex scenarios effectively.

  4. Spatial Reasoning and Global Knowledge: Open-source models like LLaMA and PaliGemma outperform others in tasks involving spatial reasoning or requiring global knowledge, such as identifying sports positions. ROMEO benefits from its multi-task learning approach, which supports robust generalization across diverse tasks.

Theories for Failure Explanation