Command Palette
Search for a command to run...
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Abstract
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.
One-sentence Summary
Researchers from Shanghai Jiao Tong University, Ant Group, and collaborators propose Region-to-Image Distillation, enabling MLLMs to internalize fine-grained perception during training—replacing costly iterative zooming with single-pass inference—validated on ZoomBench and boosting performance across multimodal tasks without runtime tools.
Key Contributions
- We introduce Region-to-Image Distillation, a training method that uses micro-cropped regions to generate high-quality, region-grounded VQA supervision for full-image models, enabling single-pass fine-grained perception without inference-time tool use or re-encoding.
- We present ZoomBench, a hybrid-annotated benchmark of 845 VQA samples spanning six fine-grained perceptual dimensions, paired with a dual-view protocol to quantify the global-regional “zooming gap” and evaluate model acuity rigorously.
- Experiments show our distilled models (ZwZ-4B/7B/8B) achieve state-of-the-art performance on fine-grained perception tasks, outperforming larger MLLMs and agentic “Thinking-with-Images” methods while improving general multimodal cognition across visual reasoning and GUI agent benchmarks.
Introduction
The authors leverage Region-to-Image Distillation to address the persistent challenge in multimodal large language models (MLLMs) of detecting fine-grained visual details—like tiny text or subtle attributes—where global context drowns out critical micro-evidence. Prior methods, such as “Thinking-with-Images,” rely on iterative, tool-based zooming during inference, which improves accuracy but introduces high latency due to repeated visual re-encoding and tool calls. The authors’ key contribution is reframing zooming as a training-time operation: they generate high-quality VQA data from micro-cropped regions using strong teacher models, then distill that region-grounded supervision back into smaller student models trained on full images, enabling single-pass fine-grained perception at inference without tool use. They also introduce ZoomBench, a hybrid-annotated benchmark with a dual-view protocol to quantify the global-regional “zooming gap,” and show their models outperform larger MLLMs and agentic baselines while maintaining low latency.
Dataset

- The authors construct ZoomBench using a Region-to-Image Distillation method: a powerful MLLM (Gemini-2.5-Pro) generates questions and candidate answers from cropped micro-regions, then maps them to full images to form spatially ungrounded but evidence-backed QA pairs.
- ZoomBench contains 845 high-quality, diverse, and challenging QA pairs drawn from high-resolution images sourced across multiple public datasets (detailed in appendices), with no overlap between training and benchmark splits to prevent data leakage.
- Each instance includes a full image and a small-ratio cropped region (typically <10% of image area) that serves as visual evidence; this dual-view setup enables evaluation of “zooming” ability and attention-based interpretation.
- Human verification is applied: three PhD-level annotators validate each QA pair for clarity, answerability, and correctness against both full and cropped views; ~1,960 raw samples are filtered down to 845, removing overly easy or ambiguous cases.
- The benchmark covers six fine-grained perception dimensions: Fine-Grained Counting, OCR, Color Attributes, Structural Attributes, Material Attributes, and Object Identification.
- Evaluation includes 224 open-ended questions with canonical answers and 621 multiple-choice questions; scoring follows a hybrid protocol detailed in Appendix 9.3.
- For training data, the authors introduce explicit visual grounding (bounding boxes) to resolve ambiguity observed in cropped views, unlike the benchmark which intentionally omits spatial grounding to test model robustness.
- Core generation rules require image-based, concise, factual answers; encourage diversity in question types (counting, OCR, structure, material, etc.); and reject low-quality images by returning empty lists.
Method
The authors leverage Region-to-Image Distillation (R2I) to synthesize high-veracity, fine-grained VQA training data from unlabeled image corpora, enabling single-pass inference without test-time tool use. The core idea is to distill regional-view expertise from strong teacher models into a student model’s global-view predictions, thereby internalizing the benefits of “zooming in” during training while preserving inference efficiency.
The pipeline begins with object-centric region proposal. Given a raw image I, an object recognition and segmentation system generates candidate bounding boxes {B1,…,Bn}, each covering at least one visible object. To target fine-grained perception, only micro-regions Ri satisfying Area(Bi)/Area(I)<τ (e.g., τ=0.1) are retained, ensuring the decisive visual evidence is sparse and easily overlooked in the global view. For each such region R, a teacher model generates perception-centric questions QR that are strictly answerable from R alone—focusing on subtle cues like tiny text, symbols, or small-instance counts.
To ensure label veracity without manual annotation, multiple teacher models independently answer each question Q∈QR on the cropped region R. The authors employ majority voting across teacher responses; only triplets (R,Q,A) with high consensus (e.g., >6/8 agreement) are retained, substantially reducing hallucinated or invalid samples.
Refer to the framework diagram: the distillation phase maps these region-level QA pairs back to the full image. A grounding transformation G(I,Q,B) overlays the bounding box B onto the original image I to form I′, and appends a spatial constraint to Q to form Q′. This yields an augmented training triplet (I′,Q′,A), where I′ and Q′ jointly anchor the question to the intended micro-region, resolving referential ambiguity that arises when the question is viewed in the global context. The authors further filter the synthetic dataset using a smaller multimodal model to remove overly easy samples, producing the final distilled dataset Dsyn.

The student model is then trained to maximize the expected task reward over this synthetic data:
θmaxE(I′,Q′,A)∼Dsyn,A∼πθ(⋅∣I′,Q′)[r(A,A)],where πθ is the student policy and r(A,A) is a task-specific reward function. During inference, the bounding box is removed, but the model retains the ability to attend to the critical micro-region due to the structural hint provided during training. This aligns with the privileged information paradigm: the model learns P(A∣I,Q,B) during training, and generalizes to P(A∣I,Q) at test time.
As shown in the figure below, the overall architecture contrasts with prior “Thinking with Images” methods that require iterative tool calls at inference. R2I decouples the tool-use phase (zoom-in synthesis) from inference, enabling direct, single-pass reasoning on the full image. The authors formalize this as a tool-action distillation framework: a tool-call action f(⋅) (e.g., zoom-in) generates an altered observation I, from which a teacher synthesizes (Q,A); an inverse transformation f−1 maps (I,Q) back to (I,Q), yielding a distilled dataset that trains the student to solve the task directly from the full image.

The authors instantiate this framework using Qwen3-VL-235B for region proposal and question generation, and Qwen3-VL-235B and GLM-4.5V as answer generators. They curate a high-resolution image pool from SA-1B, LAION, MetaCLIP, Visual Genome, CC12M, and STPLS3D, and synthesize 74K training samples after consensus and difficulty filtering. The method is generalizable to other tool actions such as flipping, 3D grounding, or expert model calls, as the core distillation mechanism remains agnostic to the specific tool used during synthesis.
Experiment
- Region-to-Image Distillation enables models to internalize zooming expertise, achieving fine-grained perception in a single forward pass without iterative tool use.
- ZwZ variants consistently outperform baseline Qwen-VL models across general, specific, and OOD benchmarks, including surpassing larger open-source models and rivaling closed-source SOTA in accuracy.
- Training on distilled synthetic data proves more effective than using larger public datasets or proxy-task synthetic data, highlighting the value of fine-grained, high-quality supervision over data volume.
- The method narrows the “zooming gap” between global and regional view performance, particularly improving on structure, material, and counting tasks where attention dilution is common.
- Visual grounding via bounding boxes overlaid on images during training significantly enhances attention localization, leading to better real-world generalization without requiring boxes at test time.
- ZwZ models outperform agentic and tool-use baselines while being substantially faster, demonstrating that inference-time zooming benefits can be internalized into model weights.
- Attention map analysis confirms ZwZ models concentrate more relevant visual attention on key regions, aligning with improved perception and reduced perceptual oversight.
- The approach validates that “information-neutral” image actions (like zooming) can be effectively distilled into models, while “information-gain” actions (like web search) remain essential for external interaction.
The authors use Region-to-Image Distillation to train compact vision-language models that achieve fine-grained perception in a single forward pass, without relying on inference-time zooming tools. Results show these models consistently outperform larger open-source baselines and match or exceed closed-source models on diverse benchmarks, while also demonstrating superior attention localization and generalization to real-world tasks. This approach effectively internalizes the benefits of tool-based zooming into model weights, achieving higher accuracy with significantly lower inference latency.

The authors use Region-to-Image Distillation to train compact vision-language models on synthetic data, enabling them to achieve fine-grained perception from full images in a single forward pass. Results show consistent improvements over baseline models and outperform larger open-source systems across general, specific, and out-of-distribution benchmarks, even surpassing agentic models that rely on iterative zooming. The method also demonstrates superior data efficiency and internalizes tool-use benefits while maintaining significantly lower inference latency.

The authors use Region-to-Image Distillation to train models that internalize fine-grained perception without requiring test-time zooming. Results show that ZwZ variants consistently achieve higher attention coverage on key image regions compared to their Qwen-VL baselines, indicating improved localization of task-relevant visual evidence. This enhanced focus directly contributes to narrowing the performance gap between global and regional views in fine-grained tasks.

The authors use Region-to-Image Distillation with bounding box overlays on images to train models that internalize fine-grained perception without requiring inference-time zooming. Results show that this approach significantly outperforms direct synthesis and alternative grounding strategies, narrowing the performance gap between global and regional views while improving attention focus on task-relevant regions. The method proves more effective than training on larger public datasets or proxy-task synthetic data, achieving strong generalization across diverse benchmarks with minimal data.

The authors use Region-to-Image Distillation to train compact vision-language models that achieve fine-grained perception in a single forward pass, without relying on inference-time zooming tools. Results show these models consistently outperform both larger open-source baselines and agentic systems across multiple benchmarks, while maintaining significantly lower inference latency. The method effectively internalizes the benefits of region-focused reasoning, narrowing the performance gap between global and regional views without requiring test-time tool calls.
