Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Swagat Padhanโ˜…Lakshya Jainโ˜…Bhavya Minesh ShahOmkar PatilThao NguyenNakul Gopalan

โ˜… Equal contribution

Arizona State University  ยท  Haverford College

MAPG pipeline overview
MAPG decomposes a natural language instruction into semantic, spatial, and metric components, grounds each with a specialized agent, and composes them probabilistically into a continuous goal distribution for the planner.

Abstract

Robots collaborating with humans need to convert natural language goals into actionable, physically grounded decisions. A command like "go two meters to the right of the fridge" requires grounding a semantic reference, a spatial relation, and a metric constraint all at once, inside a 3D scene. While recent vision-language models (VLMs) are strong at semantic grounding, they are not designed to handle metric constraints in physically defined spaces. We empirically show that state-of-the-art VLM-based grounding approaches struggle on complex metric-semantic queries.


To address this, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each one. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. We also introduce MAPG-Bench, a new benchmark specifically designed to evaluate metric-semantic goal grounding, covering 30 indoor scenes and 100 annotated queries. Finally, we present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.

Contributions

  • A multi-agent probabilistic 3D spatial reasoning framework that couples online 3D scene graphs with analytically defined continuous spatial kernels to produce planner-ready goal distributions for metric-semantic instructions.
  • MAPG-Bench, a first-of-its-kind HM3D goal grounding benchmark for metric-semantic queries, with an open-source dataset and evaluation protocol spanning 30 unique indoor scenes and 100 annotated queries designed for object-to-world goal grounding in realistic indoor layouts.
  • Empirical findings and ablations showing our method achieves low distance error (0.07 m) and low angular errors (0.3 degree yaw, 3.8 degree pitch) in goal grounding, along with a failure taxonomy covering failure mode categories for reproducible comparison with future goal-grounding systems.

Method

MAPG is built around a modular pipeline of specialized agents that share a common memory ledger and operate over a 3D scene graph of the environment. Given a natural language instruction, the system decomposes it, resolves referents, generates spatial distributions, and composes them into a single planner-ready goal density.

1. Orchestrator. The Orchestrator parses a free-form instruction into structured Spatial Description Clauses (SDCs): an anchor object, a spatial predicate (e.g., "right of"), and a metric constraint (e.g., "2 meters"). For the query "Where is 2 meters to the right of the fridge?", it extracts anchor: fridge, predicate: right-of, metric: 2.0 m. These clauses are passed to the Grounding and Spatial agents for referent resolution and distribution generation.

2. Grounding Agent. The Grounding Agent resolves each anchor referent to a concrete object instance in the 3D scene graph. It uses a combination of string similarity between the referent text and node labels, CLIP-based visual similarity between the referent and object image crops, and a spatial salience prior based on proximity and visibility. A belief distribution over referents is maintained on the shared memory ledger and updated as new observations arrive.

3. Spatial Agent. Once a referent is resolved, the Spatial Agent generates a continuous probability density function over 3D space representing where the goal is likely to be. Directional predicates (e.g., "right of") are modeled as von Mises or radial Gaussian kernels anchored to the resolved object's pose. Metric constraints are modeled as radial Gaussians centered at the predicted distance offset. These analytic kernels are composed by multiplying and renormalizing in log-space, producing a single composed goal distribution.

4. Verifier and Planner. A Verifier checks the composed distribution for coherence and can trigger corrective retries if needed. The final composed distribution is passed directly to a sampling-based planner (RRT*), which extracts the top-k waypoints as executable navigation targets. This makes the output directly usable by a motion planner without any additional interpretation.

MAPG motivating example
Motivating example: a metric-semantic instruction that requires simultaneous object grounding, spatial relation resolution, and metric constraint satisfaction.
MAPG system overview
MAPG system overview. The orchestrator decomposes the instruction; specialized agents ground each component; results are composed probabilistically into a target distribution.

Results

We evaluate MAPG on two benchmarks: MAPG-Bench, our new metric-semantic goal grounding benchmark, and HM-EQA, a standard embodied question answering benchmark. On MAPG-Bench, MAPG is compared against open-sourced specialist models (SRGPT, GraphEQA) and proprietary generalist VLMs across several variants (Gemini 3 Pro, OpenAI GPT 5.2, Gemini 2.5 Pro, Claude Opus 4.6).

Our best MAPG variant (Claude Opus 4.6) reduces the object-to-world localization error from 5.82 m down to 0.07 m (a 98.8% reduction), yaw error from 13.5 degrees to 0.3 degrees (85.9% reduction), and pitch error from 27.9 degrees to 3.8 degrees (84.2% reduction), while achieving a Task Success Rate of 0.98 and an Anchor Pick Success Rate of 0.98. These gains are primarily structural: removing the explicit spatial reasoner drops object-selection success rate from 0.42 to 0.20, confirming that explicit decomposition and composition are the primary drivers, not just better prompting.

On HM-EQA question answering, MAPG (Claude 4.6 Opus) reaches a task success rate of 0.71, outperforming all compared baselines including GraphEQA-Gemini-2.5Pro at 0.67. We also ran a real-world robot demo using a Robotis AI Worker robot, constructing a scene graph from a physical indoor environment and running MAPG inference on spatial queries. MAPG successfully grounded the queried targets, showing the approach transfers beyond simulation.

Decomposed Grounding Visualizations

The figures below show how each component of MAPG contributes to the final composed goal distribution for the query "Where is 2 meters to the right of the fridge?" Semantic grounding identifies the fridge, the metric kernel models the 2 m offset, the directional kernel captures "right of," and their composition yields a planner-ready goal distribution.

Semantic Grounding
Semantic Grounding
Spatial Reasoning
Spatial Reasoning
Metric Constraint
Metric Constraint
Composed Distribution
Composed Distribution

BibTeX

@misc{padhan2026meaningsmeasurementsmultiagentprobabilistic,
  title         = {Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation},
  author        = {Swagat Padhan and Lakshya Jain and Bhavya Minesh Shah and Omkar Patil and Thao Nguyen and Nakul Gopalan},
  year          = {2026},
  eprint        = {2603.19166},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2603.19166}
}