When Robots Should Say "I Don’t Know": Benchmarking Abstention in Embodied Question Answering

Tao Wu1,2, Chuhao Zhou1, Guangyu Zhao3, Haozhi Cao1, Yewen Pu2, Jianfei Yang1†
Corresponding author.
1 MARS Lab, Nanyang Technological University, 2 NPL, Nanyang Technological University, 3 CraftJarvis, Peking University

Figure 1: Two settings of AbstainEQA: Episodic-Memory EQA (Left): An user asks whether the bathroom floor is wet, but the agent has not visited the bathroom yet and incorrectly responds “dry,” posing a risk of the user slipping and falling. Active EQA (Right): The user did not specify which pillow to check, causing the agent to wander aimlessly around the house.

Abstract

Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer.

From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation.

Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.

Dataset Collection Pipeline

Figure 2: The overall construction process of AbstainEQA, showing how naturally posed human queries, abstention taxonomy, paired datasets, evidence annotations, and data augmentation together form a unified benchmark for evaluating uncertainty-aware embodied agents. The entire pipeline required 460 hours of human annotation.

Examples in AbstainEQA

Figure 3: Example human-annotated ambiguous queries and dataset statistics from AbstainEQA, illustrating the types of uncertainty encountered in real human–agent interactions and providing a stronger foundation for studying and improving embodied interaction.

Agent's Fabricated Answers

Figure 4: Examples of ambiguous queries that should trigger abstention, yet the model produces unfounded answers based on incomplete visual evidence.

Impacts of Ambiguous Queries on Navigation

Figure 5: Changes of Path-lengths from GPT-4o under ambiguous queries. Two cases of ineffective exploration are shown: shortened and lengthened trajectories.

SFT Fails to Improve Abstention Capability

Figure 6: Examples showing that SFT-trained models rely on memorized textual patterns rather than visual evidence: the same query receives the same response across different scenes, even when one case is answerable and the other requires abstention.