Qwen-VL: A Versatile Vision-Language Model for Understanding,
Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai Shuai Bai Shusheng Yang Shijie Wang Sinan Tan Peng Wang Junyang Lin Chang Zhou Jingren Zhou

Abstract
In this work, we introduce the Qwen-VL series, a set of large-scalevision-language models (LVLMs) designed to perceive and understand both textsand images. Starting from the Qwen-LM as a foundation, we endow it with visualcapacity by the meticulously designed (i) visual receptor, (ii) input-outputinterface, (iii) 3-stage training pipeline, and (iv) multilingual multimodalcleaned corpus. Beyond the conventional image description andquestion-answering, we implement the grounding and text-reading ability ofQwen-VLs by aligning image-caption-box tuples. The resulting models, includingQwen-VL and Qwen-VL-Chat, set new records for generalist models under similarmodel scales on a broad range of visual-centric benchmarks (e.g., imagecaptioning, question answering, visual grounding) and different settings (e.g.,zero-shot, few-shot). Moreover, on real-world dialog benchmarks, ourinstruction-tuned Qwen-VL-Chat also demonstrates superiority compared toexisting vision-language chatbots. Code, demo and models are available athttps://github.com/QwenLM/Qwen-VL.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| chart-question-answering-on-chartqa | Qwen-VL | 1:1 Accuracy: 65.7 |
| chart-question-answering-on-chartqa | Qwen-VL-Chat | 1:1 Accuracy: 66.3 |
| fs-mevqa-on-sme | Qwen-VL-Max | #Learning Samples (N): 16 ACC: 40.33 BLEU-4: 24.30 CIDEr: 201.47 Detection: 1.05 METEOR: 23.40 ROUGE-L: 34.52 SPICE: 26.13 |
| mmr-total-on-mrr-benchmark | Qwen-vl-max | Total Column Score: 366 |
| mmr-total-on-mrr-benchmark | Qwen-vl-plus | Total Column Score: 310 |
| natural-language-visual-grounding-on | Qwen-VL | Accuracy (%): 5.2 |
| spatial-reasoning-on-embspatial-bench | Qwen-VL-Max | Generation: 49.11 |
| visual-question-answering-on-docvqa-test | Qwen-VL | ANLS: 0.651 |
| visual-question-answering-on-docvqa-test | Qwen-VL-Plus | ANLS: 0.9024 |
| visual-question-answering-on-docvqa-test | Qwen-VL-Chat | ANLS: 0.626 |
| visual-question-answering-on-mm-vet | Qwen-VL-Max | GPT-4 score: 66.6±0.5 |
| visual-question-answering-on-mm-vet | Qwen-VL-Plus | GPT-4 score: 61.1±0.2 |
| visual-question-answering-on-mm-vet-v2 | Qwen-VL-Max | GPT-4 score: 55.8±0.2 |
| visual-question-answering-on-vip-bench | Qwen-VL-Chat (Coordinates) | GPT-4 score (bbox): 45.3 |
| visual-question-answering-on-vip-bench | Qwen-VL-Chat (Visual Prompt) | GPT-4 score (bbox): 39.2 GPT-4 score (human): 41.7 |
| visual-question-answering-vqa-on-core-mm | Qwen-VL-Chat | Abductive: 44.39 Analogical: 30.42 Deductive: 37.55 Overall score: 37.39 Params: 16B |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.