2 hours ago

Lin-Zhuo Chen Jian Gao Yihang Chen Ka Leong Cheng Yipengjing Sun Liangxiao Hu Nan Xue Xing Zhu Yujun Shen Yao Yao

Table of Contents

Abstract

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 × 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.

One-sentence Summary

LingBot-Map is a feed-forward 3D foundation model for streaming 3D reconstruction built upon a geometric context transformer architecture that employs a carefully designed attention mechanism integrating anchor context, a pose-reference window, and trajectory memory to ensure geometric accuracy and temporal consistency, enabling stable inference at around 20 FPS on 518 × 378 resolution inputs over sequences exceeding 10,000 frames and achieving superior performance against existing streaming and iterative optimization-based approaches across various benchmarks.

Key Contributions

LingBot-Map is introduced as a feed-forward 3D foundation model built upon a geometric context transformer (GCT) architecture for reconstructing scenes from streaming data.
A specialized attention mechanism integrates an anchor context, a pose-reference window, and a trajectory memory to retain geometric context and correct long-range drift without requiring test-time training.
Extensive evaluations across a variety of benchmarks demonstrate superior performance compared to existing approaches while enabling stable inference at approximately 20 FPS on 518 × 378 resolution inputs over sequences exceeding 10,000 frames.

Introduction

Streaming 3D reconstruction enables robots and autonomous systems to recover camera poses and point clouds from continuous video feeds, requiring high geometric accuracy and temporal consistency. While recent foundation models excel offline, existing streaming approaches struggle to balance rich context with computational efficiency over long sequences. Recurrent methods often forget geometric priors, caching strategies lead to unbounded memory growth, and hybrid SLAM systems rely on slow iterative optimization. The authors introduce LingBot-Map, a feed-forward 3D foundation model built on a Geometric Context Transformer to resolve these issues. Its specialized attention mechanism manages anchor context, local pose references, and trajectory memory to enable stable inference at 20 FPS for sequences exceeding 10,000 frames.

Dataset

Dataset Composition and Sources

The authors curate a training corpus of 29 datasets covering indoor, outdoor, object-centric, synthetic, and real-world scenarios.
Data is categorized into multi-view collections with unordered frames and video sequences with continuous camera trajectories.
Evaluation benchmarks consist of five datasets: Oxford Spires, ETH3D, 7-Scenes, Tanks and Temples, and NRGBD.

Training Data Usage and Sampling

Stage 1 builds general geometric priors using all 29 datasets with roughly balanced sampling ratios.
Stage 2 shifts focus to long-trajectory video data by increasing sampling weights for datasets like TartanAir, Waymo, and ScanNet++.
Multi-view-only datasets are down-weighted or dropped in the second stage to prioritize temporal structure.
The Foldback Video Sampler creates temporally coherent subsequences by reversing direction at sequence boundaries to avoid degenerate oscillation.
Each iteration samples 2 to 24 frames per scene with a dynamic batch sampler that packs at most 48 images per GPU.

Data Processing and Standardization

All public datasets are standardized into a unified format by converting coordinate systems to OpenCV standards and normalizing depth scales to meters.
Corrupted frames are filtered out based on file consistency, minimum frame thresholds, and invalid depth values like NaN or Inf.
Metadata is serialized into pickle files containing scene lists, frame mappings, and 4x4 pose matrices for efficient training composition.
Synthetic data is rendered from Objaverse and Texverse assets using Blender Cycles with metric depth in OpenEXR format.

Specialized Data Construction

Internal game engine data provides long trajectories through diverse indoor and outdoor environments while excluding cutscenes and UI overlays.
MatrixCity aerial and street data are reorganized into temporally continuous sequences via random walks on spatial topologies.
Cross-scene traversal sequences are generated using Habitat-Sim on Gibson, Matterport3D, and HM3D scenes to simulate long-range navigation through multiple rooms.
This specialized generation produces approximately 2,800 sequences totaling 14.4 TB with smooth camera motion and realistic gaze shifts.

Evaluation Benchmark Configuration

Oxford Spires uses 13 scenes with sparse and dense frame settings to test single-pass and streaming capabilities.
7-Scenes frames are downsampled by a stride of 5 to reduce redundancy while retaining viewpoint coverage.
ETH3D and Tanks and Temples utilize all available frames with specific depth thresholds for reconstruction metrics.

Method

The authors propose LingBot-Map, a streaming foundation model designed for long-range 3D reconstruction from continuous visual input. Given a stream of images $\mathcal{I} = \{I_1, I_2, \ldots\}$ , the system processes each new frame $I_t$ to estimate its camera pose $\hat{P}_t$ and depth map $\hat{D}_t$ using only current and past observations. This approach enables the reconstruction of large-scale 3D environments in real-time without requiring access to future frames.

The overall framework, illustrated in the pipeline diagram below, relies on a Vision Transformer (ViT) backbone initialized from DINOv2. Each input image is encoded into $M$ image tokens, which are augmented with a camera token, four register tokens, and a learnable anchor token. These tokens are processed through multiple alternating layers of Frame Attention and Geometric Context Attention (GCA). Frame Attention refines features within each frame, while GCA facilitates cross-frame geometric reasoning. Finally, task-specific heads predict the absolute camera pose and the corresponding depth map.

A key innovation of this architecture is the Geometric Context Attention mechanism, which addresses the challenge of managing geometric context in a streaming setting. The model must retain sufficient long-range context for global consistency while keeping the state compact for efficient inference. Drawing inspiration from classical SLAM systems, GCA decomposes the streaming context into three complementary components: Anchor Context, Local Pose-Reference Window, and Trajectory Memory.

The Anchor Context establishes a consistent coordinate system and absolute scale by designating the first $n$ images as anchor frames. The Local Pose-Reference Window maintains a sliding window of the $k$ most recent frames to provide dense visual overlap for accurate frame registration. The Trajectory Memory retains a compact summary of the full observation history to correct accumulated drift. This structured approach allows the model to balance long-term consistency with bounded per-frame cost.

The efficiency of GCA is achieved through a specialized attention mask design. As shown in the comparison of attention patterns below, standard full attention cannot operate in a streaming fashion, while causal attention causes memory to grow linearly with sequence length. Sliding window attention bounds computation but sacrifices long-term context. In contrast, GCA combines the anchor context, trajectory memory, and local window into a structured mask that retains rich long-range context while keeping memory and computation nearly constant as the sequence length increases.

To ensure robust trajectory estimation, the model incorporates a trajectory memory that summarizes past observations. This memory maintains a lightweight yet temporally ordered record of all past frames, providing long-range cues to correct drift. The connectivity and path consistency within this trajectory are managed to ensure smooth transitions across the sequence, as visualized in the path connectivity diagram below.

Training is performed using a composite loss function that includes depth, absolute pose, and relative pose terms. The depth and absolute pose losses follow standard definitions, while a relative pose loss is applied over frame pairs within the sliding window to encourage local trajectory consistency. To handle the computational cost of long sequences, the authors employ a progressive view training strategy, starting with short subsequences and gradually increasing the number of views. Additionally, context parallelism is used to distribute views across multiple GPUs, enabling efficient training on long sequences.

For inference, the system utilizes a paged key-value (KV) cache layout to manage memory efficiently. This approach eliminates the overhead of frequent cache updates associated with standard contiguous layouts. By leveraging optimized attention kernels and paged memory management, the implementation achieves real-time performance, processing video sequences at approximately 20 FPS while maintaining stable reconstruction over thousands of frames.

Experiment

The evaluation compares LingBot-Map against offline, optimization-based, and streaming baselines across diverse benchmarks to validate camera pose estimation and 3D reconstruction capabilities. Results indicate that the method achieves superior global consistency and reconstruction fidelity in long sequences where competing approaches suffer from accumulated drift and geometric fragmentation. Qualitative analysis confirms accurate trajectory tracking through complex scene transitions without requiring explicit optimization or loop closure. Furthermore, ablation studies verify that architectural components such as temporal encoding and trajectory memory are essential for maintaining long-range stability and computational efficiency.

The authors compare a bounded pose-reference window against full causal attention to analyze efficiency and accuracy trade-offs. The bounded window approach significantly improves inference speed and reduces memory consumption while simultaneously lowering trajectory error and translation error. Although full attention yields slightly better rotation accuracy, the bounded window provides superior overall performance for streaming applications. Bounded window achieves higher inference speed and lower memory usage than full attention. Global trajectory accuracy and translation error improve with the bounded window configuration. Full attention offers a slight advantage in rotation error but suffers from higher computational costs.

The authors evaluate camera pose estimation on the Oxford Spires dataset, comparing an online streaming approach against offline, optimization-based, and other online baselines. The results indicate that the proposed method achieves superior performance across nearly all metrics, including pose accuracy and trajectory consistency. It demonstrates significant advantages over competing streaming methods while outperforming more computationally intensive offline and optimization-based techniques. The proposed online method achieves the highest pose accuracy and lowest trajectory error compared to offline and optimization-based baselines. Performance significantly exceeds other streaming approaches, which suffer from accumulated drift and lower accuracy. The method maintains strong global consistency and local frame-to-frame accuracy without requiring access to future frames.

The authors conduct an ablation study to evaluate the individual contributions of Anchor Initialization, Context Tokens, Relative Pose Loss, and Video RoPE to the model's pose estimation capabilities. The results indicate that each component provides incremental improvements, with the full configuration yielding the best performance across all trajectory and pose accuracy metrics. Notably, the inclusion of temporal positional encoding via Video RoPE leads to the most substantial reduction in trajectory error. Anchor Initialization resolves scale ambiguity and improves both local and global pose accuracy. Context Tokens preserve geometric cues from the full history, effectively reducing accumulated drift. Video RoPE injects temporal ordering, enabling the model to reason about sequential structure for better trajectory consistency.

The authors evaluate LingBot-Map against state-of-the-art streaming methods on a large-scale trajectory estimation benchmark under both sparse and dense frame settings. Results indicate that LingBot-Map achieves superior trajectory accuracy in both scenarios, whereas competing methods suffer from significant performance degradation as the sequence length increases. Additionally, the proposed method maintains competitive inference speeds while delivering the highest precision among the evaluated approaches. LingBot-Map achieves the lowest trajectory error in both sparse and dense settings, outperforming all competing streaming baselines. Competing methods show substantial degradation in accuracy when moving to dense sequences, whereas LingBot-Map maintains consistent performance with minimal drift. The approach achieves a strong balance between accuracy and efficiency, offering inference speeds comparable to other real-time streaming methods.

The authors evaluate 3D reconstruction quality across three datasets (ETH3D, 7-Scenes, NRGBD) comparing LingBot-Map against various online streaming baselines. Results indicate that the proposed method consistently achieves superior performance in terms of accuracy, completeness, and F1 scores compared to competing approaches. On the ETH3D dataset, the proposed method achieves a substantially higher F1 score compared to the runner-up, driven by improvements in both accuracy and completeness. For 7-Scenes, the method attains the lowest error rates for accuracy and completeness while securing the top F1 ranking among all online methods. On NRGBD, the approach demonstrates a clear advantage with the best F1 score and lowest completeness error, outperforming the next best baseline by a notable margin.

The authors validate their approach by comparing a bounded pose reference window against full attention, revealing that the bounded configuration optimizes inference speed and memory usage while improving trajectory accuracy. Evaluations on the Oxford Spires dataset and large-scale benchmarks demonstrate that the proposed method outperforms offline optimization and competing streaming baselines by minimizing accumulated drift without sacrificing precision. An ablation study confirms that components such as Video RoPE and Anchor Initialization are critical for resolving scale ambiguity and enhancing temporal consistency. Consequently, the method achieves superior 3D reconstruction quality and trajectory performance across diverse datasets while maintaining real-time efficiency.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

2 hours ago

Lin-Zhuo Chen Jian Gao Yihang Chen Ka Leong Cheng Yipengjing Sun Liangxiao Hu Nan Xue Xing Zhu Yujun Shen Yao Yao

Table of Contents

Abstract

One-sentence Summary

Key Contributions

LingBot-Map is introduced as a feed-forward 3D foundation model built upon a geometric context transformer (GCT) architecture for reconstructing scenes from streaming data.
A specialized attention mechanism integrates an anchor context, a pose-reference window, and a trajectory memory to retain geometric context and correct long-range drift without requiring test-time training.
Extensive evaluations across a variety of benchmarks demonstrate superior performance compared to existing approaches while enabling stable inference at approximately 20 FPS on 518 × 378 resolution inputs over sequences exceeding 10,000 frames.

Introduction