1. Orchestrator. The Orchestrator parses a free-form instruction
into structured Spatial Description Clauses (SDCs): an anchor object, a spatial
predicate (e.g., "right of"), and a metric constraint (e.g., "2 meters"). For the
query "Where is 2 meters to the right of the fridge?", it extracts
anchor: fridge, predicate: right-of, metric: 2.0 m. These clauses
are passed to the Grounding and Spatial agents for referent resolution and
distribution generation.
2. Grounding Agent. The Grounding Agent resolves each anchor referent to a concrete object instance in the 3D scene graph. It uses a combination of string similarity between the referent text and node labels, CLIP-based visual similarity between the referent and object image crops, and a spatial salience prior based on proximity and visibility. A belief distribution over referents is maintained on the shared memory ledger and updated as new observations arrive.
3. Spatial Agent. Once a referent is resolved, the Spatial Agent generates a continuous probability density function over 3D space representing where the goal is likely to be. Directional predicates (e.g., "right of") are modeled as von Mises or radial Gaussian kernels anchored to the resolved object's pose. Metric constraints are modeled as radial Gaussians centered at the predicted distance offset. These analytic kernels are composed by multiplying and renormalizing in log-space, producing a single composed goal distribution.
4. Verifier and Planner. A Verifier checks the composed distribution for coherence and can trigger corrective retries if needed. The final composed distribution is passed directly to a sampling-based planner (RRT*), which extracts the top-k waypoints as executable navigation targets. This makes the output directly usable by a motion planner without any additional interpretation.