Command Palette
Search for a command to run...
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
Abstract
Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.
One-sentence Summary
This work introduces SpatialClaw, a training-free framework that adopts code as a flexible action interface, surpassing the inflexibility of single-pass execution and structured tool calls to enable vision-language models to freely compose perception operations for open-ended 3D and 4D spatial reasoning.
Key Contributions
- The paper introduces SpatialClaw, a training-free framework that replaces rigid tool menus with executable Python code as a dynamic action interface for open-ended 3D and 4D spatial reasoning.
- The method delivers controlled comparisons against alternative action interfaces alongside trace-level analyses that isolate the specific spatial reasoning patterns responsible for performance improvements.
- Evaluated across multiple spatial reasoning benchmarks using a consistent scoring protocol, the approach demonstrates that parameter-free code execution enhances vision-language models without requiring fine-tuning or additional training data.
Introduction
Vision-language models have advanced significantly in general perception, yet spatial reasoning remains a persistent bottleneck for applications in robotics, embodied AI, and assistive systems. Prior approaches typically rely on costly fine-tuning or rigid tool-augmented architectures that lock models into fixed interfaces, commit to complete programs before evaluating intermediate outputs, or lack the ability to inspect perception results step-by-step. To address these constraints, the authors introduce SpatialClaw, a training-free framework that equips existing VLMs with a code action interface. By generating executable Python scripts on demand, the system dynamically composes specialized perception tools, enables turn-by-turn verification of intermediate results, and substantially improves spatial reasoning without adding parameters or requiring additional training data.
Dataset
- Dataset composition and sources: The authors draw from a video sequence dataset where each sample provides a list of PIL images alongside a textual object description.
- Key subset details: The primary collection focuses on object existence verification. Annotations are organized within a PerFrameMask structure that records frame indices, object labels, total frame counts, and object quantities, enabling direct indexing by absolute frame position.
- Data usage and processing: The authors use this dataset to train a model that evaluates object presence, counts instances, and generates descriptions. The system accepts the image list and object name as input, then returns a dictionary containing existence flags, instance counts per image, and a short textual summary.
- Metadata construction and processing details: Metadata is managed through the PerFrameMask object, which provides functions to extract 2D boolean segmentation masks, compute median 3D world centroids from reconstructed point clouds, and retrieve masked 3D point arrays. The pipeline also includes a visualization utility that overlays results for frame by frame verification.
Experiment
Evaluated across twenty spatial reasoning benchmarks spanning single-image, multi-view, and video domains, SpatialClaw is tested against multiple baselines using six diverse open-source vision-language models. The experiments validate that a persistent code-based action interface consistently outperforms structured tool calls and single-pass execution, particularly on tasks demanding iterative geometric computation across frames and viewpoints. Qualitative analyses reveal that the agent spontaneously adapts its tool usage to question semantics and that performance gains stem primarily from flexible code composition rather than predefined utility functions or model scale. Ultimately, the study concludes that designing expressive, revision-capable action interfaces is a highly impactful and generalizable strategy for advancing training-free spatial reasoning agents.