HyperAIHyperAI

Command Palette

Search for a command to run...

From Pixels to Words -- Towards Native One-Vision Models at Scale

Abstract

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

One-sentence Summary

NEO-ov is a native one-vision foundation model that learns pixel-word and cross-frame correspondences end-to-end without external encoders, adapters, or post-hoc fusion, eliminating modular boundaries to enable unified spatiotemporal modeling that achieves competitive performance at scale across multi-image, video, and fine-grained visual perception tasks.

Key Contributions

  • The paper introduces NEO-ov, a native vision-language foundation model that eliminates pre-trained encoders and adapters to unify spatial and temporal modeling within a single monolithic backbone. By learning cross-frame and pixel-word correspondence end-to-end from raw inputs, the architecture preserves fine-grained visual signals for unified spatiotemporal reasoning.
  • Benchmark evaluations demonstrate that the encoder-free model surpasses existing native VLMs and approaches encoder-based competitors across diverse multimodal tasks. The unified representation space captures low-level geometric perception, motion dynamics, and long-range visual dependencies, delivering robust spatial intelligence without fragmented feature alignment.
  • The work presents systematic architectural analyses and detailed training recipes that document the design choices and optimization strategies for native multimodal modeling. These contributions validate the feasibility of native architectures at scale and facilitate subsequent research in unified vision-language systems.

Introduction

Current vision-language models are increasingly deployed for complex multimodal applications like video understanding, multi-image analysis, and spatial reasoning, yet they rely on a modular encoder-decoder architecture that connects pretrained vision encoders to large language models. This fragmented design forces early compression of visual signals, discards fine-grained spatial and texture details, and creates efficiency and scalability bottlenecks that hinder true cross-modal integration. To overcome these constraints, the authors introduce NEO-ov, a native one-vision foundation model that removes external encoders and adapters entirely. By training a single monolithic backbone end-to-end on raw inputs, the model learns pixel-word correspondence and spatiotemporal dynamics natively, delivering competitive performance across diverse benchmarks while providing a clear architectural blueprint for future unified multimodal systems.

Dataset

  • Dataset composition and sources: The authors draw all resources from open-access datasets that feature explicitly defined usage policies.
  • Key details for each subset: The provided excerpt does not specify subset sizes, individual sources, or filtering rules.
  • Data usage and training configuration: The text does not outline training splits, mixture ratios, or how the data is integrated into the model.
  • Processing and metadata: No cropping strategies, metadata construction, or additional preprocessing steps are described in the provided section.
  • Additional procedural notes: The authors clarify that large language models served only as writing assistants for grammar and style refinement. All methodological, experimental, and conclusion content was developed and verified entirely by the human authors.

Method

The authors leverage a unified native vision-language backbone to extend autoregressive modeling across single-image, multi-image, and video inputs, forming a monolithic architecture that supports cross-image reasoning, temporal understanding, and spatial localization. The framework processes image and video inputs, along with text, into a unified sequence of tokens that are jointly processed by a single decoder-only model. Image inputs are encoded into visual tokens using a lightweight patch embedding layer composed of two convolutional layers with a GELU activation, producing one visual token for each 32×3232 \times 3232×32 region. Text inputs are tokenized using the original language model tokenizer. Visual tokens are wrapped with <img> and </img> delimiters and concatenated with text tokens, forming a single sequence that is processed by the shared backbone. This approach enables efficient pixel-word and pixel-pixel alignment, as well as spatial-temporal reasoning within a single native framework.

Refer to the framework diagram to understand how image and video inputs, represented at their original resolutions, are processed alongside text through a patch embedding layer and word embedding layer, respectively, to form a unified token sequence that enters the native vision-language backbone.

The model employs a THW-decoupled attention mechanism, where attention heads are explicitly designed with separate dimensions for temporal (TTT), height (HHH), and width (WWW) components. This design preserves the temporal modeling capability of the base language model while augmenting it with dedicated spatial modeling. For tokens iii and jjj, the Query and Key features are decomposed into TTT, HHH, and WWW components, and their correlation is computed as the sum of inner products across each dimension. The TTT branch captures textual order, cross-image relations, and cross-frame dependencies, while the HHH and WWW branches model 2D spatial structure. This is complemented by native rotary positional embeddings (Native-RoPE), which assign distinct indices for temporal and spatial positions. Text tokens retain only the temporal index, with spatial indices set to zero, whereas image tokens share a common temporal index within each image and use hih_ihi and wiw_iwi to encode their spatial coordinates. Temporal indices remain continuous across modalities, while spatial indices are independently defined within each image.

As shown in the figure below, the native rotary position embedding system unifies bidirectional spatial interactions within images with causal dependencies across text and video frames through a THW-aware frequency channel and index allocation, enabling unified modeling across single-image, multi-image, and video understanding.

For multi-image inputs, each <img> token in the prompt is replaced by an independent visual segment, preserving the textual order and representing each image as a distinct unit in the sequence. This allows images to be encoded at arbitrary resolution, adapting the number of visual tokens to the image's spatial size, which is beneficial for fine-grained comparison and spatially sensitive tasks. For video inputs, the model represents the video as a temporally ordered sequence of sampled frames, each serialized as an image unit with an associated timestamp. A global prefix encoding the video duration, number of sampled frames, and sampling rate is prepended, and explicit timestamps are included to facilitate temporal localization and cross-frame reasoning.

The training procedure for NEO-ov consists of three progressive stages. In the pre-training stage, the model develops foundational visual perception while aligning visual representations with the language backbone's semantic space. Optimization is restricted to the patch embedding layers, pre-buffer layers, and newly introduced QK-related parameters, using an autoregressive next-token objective. The mid-training stage scales spatial-temporal reasoning and enhances perception over high-resolution visual content, with all model layers jointly optimized on a diverse dataset. The context length is progressively extended, and a unified mixture of data types is used to improve stability and generalization. In the supervised fine-tuning stage, the model is refined on high-quality instruction-tuning data, with end-to-end optimization to strengthen fine-grained perception, long-context reasoning, and temporal dynamics modeling.

Experiment

The evaluation assesses NEO-ov across image understanding, video comprehension, and spatial intelligence tasks, benchmarking it against both native and modular vision-language architectures while conducting ablation studies on attention mechanisms and training progression. The results demonstrate that native end-to-end modeling successfully preserves fine-grained visual context and long-range dependencies, enabling robust reasoning and effective hallucination suppression without external encoders. Additionally, deep pixel-level interactions and progressive training stages consistently strengthen spatial perception and cross-modal generalization, collectively validating the scalability and competitive advantage of unified native multimodal frameworks.

The authors evaluate NEO-ov on multi-image and video understanding benchmarks, comparing it against both modular and native vision-language models. Results show that NEO-ov achieves competitive or superior performance across various tasks, particularly in video understanding and multi-image reasoning, demonstrating the effectiveness of its native architecture. The model consistently outperforms prior native models and matches or exceeds modular counterparts in key areas such as temporal reasoning and long-context understanding. NEO-ov achieves competitive or superior performance compared to both modular and native vision-language models on multi-image and video understanding benchmarks. NEO-ov shows strong gains in video understanding tasks, particularly in long-context and temporal reasoning, outperforming several modular models. NEO-ov demonstrates consistent improvements across different scales and training stages, indicating effective progressive training for multimodal capabilities.

The authors evaluate NEO-ov across various benchmarks, comparing its performance to both specialized and general-purpose models. Results show that NEO-ov achieves competitive or superior performance, particularly in spatial intelligence tasks, and demonstrates strong scalability across different model sizes. The model consistently outperforms or matches leading alternatives on multiple benchmarks, highlighting its effectiveness in capturing fine-grained visual and spatial representations. NEO-ov achieves competitive or superior performance compared to spatial-specialist models on multiple spatial intelligence benchmarks. NEO-ov outperforms general-purpose models on several tasks, especially in spatial reasoning and geometric understanding. The model shows consistent performance improvements across different scales and training stages, indicating strong scalability and generalization.

The authors evaluate NEO-ov across multiple domains including image understanding, video understanding, and spatial intelligence, demonstrating strong performance compared to both native and modular VLMs. The results show that progressive training stages improve performance across all benchmarks, with more significant gains observed at smaller model scales. The model achieves competitive or superior results on reasoning-intensive and hallucination-sensitive tasks, highlighting the effectiveness of native end-to-end modeling. NEO-ov achieves competitive or superior performance compared to both native and modular VLMs across diverse benchmarks. Progressive training stages consistently improve performance, with more pronounced gains at smaller model scales. NEO-ov shows strong performance on reasoning-intensive and hallucination-sensitive tasks, demonstrating the effectiveness of native modeling.

The authors evaluate NEO-ov on multiple benchmarks across image understanding, OCR recognition, video understanding, and spatial intelligence. Results show that NEO-ov achieves strong performance across various tasks, particularly excelling in reasoning-intensive and hallucination-sensitive scenarios. It demonstrates competitive or superior performance compared to both native and modular vision-language models, especially in tasks requiring fine-grained visual and spatial understanding. NEO-ov achieves strong performance on reasoning-intensive and hallucination-sensitive benchmarks, surpassing prior native and modular models. NEO-ov outperforms other models on OCR recognition and spatial intelligence tasks, highlighting its ability to capture fine-grained visual and spatial representations. The model shows consistent improvements across different scales and training stages, indicating effective learning and generalization capabilities.

The authors compare different architectural approaches for multimodal models, focusing on the performance of a native architecture with a pre-buffer mechanism against traditional encoder-based methods. Results show that the pre-buffer approach achieves competitive or superior performance across various tasks, particularly in OCR and spatial intelligence, suggesting that direct pixel-level interactions enhance visual understanding. The study also highlights that progressive training stages improve performance, especially for smaller model sizes, indicating effective learning of multimodal capabilities. The pre-buffer mechanism outperforms encoder-based methods on OCR and spatial intelligence tasks, indicating better handling of fine-grained visual details and spatial dependencies. Native architectures with direct pixel-pixel and pixel-word interactions show stronger performance on spatial intelligence benchmarks compared to encoder-based models. Progressive training stages lead to consistent performance improvements, with more significant gains observed in smaller model variants.

The authors evaluate NEO-ov across benchmarks covering multi-image and video understanding, spatial intelligence, and OCR recognition, comparing it against modular, native, and encoder-based vision-language models. These experiments validate the effectiveness of a native architecture utilizing a pre-buffer mechanism for direct pixel-level interactions, alongside a progressive training strategy. Qualitatively, the model consistently delivers competitive or superior results, particularly excelling in temporal reasoning, spatial understanding, and hallucination-sensitive scenarios. Overall, the findings demonstrate that the proposed approach achieves strong scalability, captures fine-grained visual representations effectively, and generalizes reliably across different model scales.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp