GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

1Carnegie Mellon University 2Neya Systems 3Hello Robot Inc. 4Bosch Center for AI

GraphEQA deployed on the Hello Robot Stretch RE2 platform in a home environment.
Question: "Is there a blue pan on the stove?"
Left panel: Robot navigating a home, while planning and mapping the environment in real-time; captured via an externally mounted camera.
Right panel: Metric-semantic 3D mesh and scene graph construction from Hydra. TSDF-based 2D occupancy map where white nodes represent explored areas, red nodes are obstacles, blue nodes are clustered frontiers. Green node shows the target location chosen by the planner.
Right panel inset: Video feed from the robot head camera.

Abstract

A depiction of GraphEQA in simulation.

In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSG) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. Through experiments in simulation on the HM-EQA dataset and in the real world, we demonstrate that our method outperforms key baselines by completing EQA tasks with higher success rate and fewer planning steps.

Method

The overall GraphEQA method.

Overall GraphEQA architecture. As the agent explores the environment, it used its sensor data (RGBD images, semantic map, camera poses and intrinsics) to construct a 3D metric-semantic hierarchical scene graph (3DSG) as well as a 2D occupancy map for frontier selection in real time. The constructed 3DSG is then enriched with semantic room labels and semantically-enriched frontiers. From the set of images collected during each trajectory execution, a task-relevant subset is selected, called the task-relevant visual memory. A VLM-based planner takes as input the enriched scene graph, task-relevant visual memory, a history of past states and actions, and the embodied question and outputs the answer, its confidence in the selected answer, and the next step it needs to take in the environment. If the VLM agent is confident in its answer, the episode is terminated, else the proposed action is executed in the environment and the process repeats.

Hierarchical VLM Planner Architecture

The overall GraphEQA method.

The Hierarchical Vision-Language planner takes as input the question, enriched scene graph, task-relevant visual memory, current state of the robot (position and room) and a history of past states, actions, answers and confidence values. The planner chooses the next Goto_Object_node action hierarchically by first selecting the room node and then the object node. The Goto_Frontier_node action is chosen based on the object nodes connected to the frontier via edges in the scene graph. The planner is asked to output a brief reasoning behind choosing each action. The planner also outputs an answer, confidence in its answer, reasoning behind the answer and confidence, the next action, a brief description of the scene graph and the visual memory.

Experimental Results

The results.

We compare the performance of GraphEQA against competitive baselines in simulation in the Habitat-Sim environment on the HM3D-EQA dataset and in the real-world in two indoor environments: home and office. The above table shows comparison to baselines in simulation using metrics: average success rate (%), average number of VLM planning steps, and average trajectory length Lτ. We compare against a strong baseline, Explore-EQA, which calibrates Prismatic-VLM to answer embodied questions confidently. We also compare against Explore-EQA-GPT4o, a version of Explore-EQA which used GPT-4o instead of Prismatic-VLM. Finally, we compare against SayPlanVision, a modified version of SayPlan, which in addition to the full scene graph also has access to task-relevant visual memory. We observe that GraphEQA has higher success rate as compared to Explore-EQA and Explore-EQA-GPT4o, without the need to build an explicit 2D semantic task-specific memory. Compared to Explore-EQA, our method completes the task in significantly lower planning steps and navigates the environment more efficiently (lower trajectory length). Explore-EQA-GPT4o has lower success rates but also has lower planning steps. Qualitative results show that Explore-EQA-GPT4o tends to be overconfident and terminates the episode early. GraphEQA outperforms SayPlanVision, without needing access to the complete scene graph. Qualitatively, we observe that given access to the complete scene graph, SayPlanVision is overconfident about its choice of object node actions, leading to shorter trajectory lengths in successful cases, but also to increased failure cases.

Additional Real-world Experiments

BibTeX

@article{grapheqa2024,
  author    = {Saxena, Saumya and Buchanan, Blake and Paxton, Chris and Chen, Bingqing and Vaskevicius, Narunas and Palmieri, Luigi and Francis, Jonathan, and Kroemer, Oliver},
  title     = {GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering},
  journal   = {arXiv},
  year      = {2024},
}