Date

4 months ago

Organization

GPU Compute Airdrop

20 Hours of RTX 5090 Compute Resources for Only $1 (Worth $7)

Redeem Now

1. Tutorial Introduction

Long-VITA is a research achievement of long-context multimodal large-scale models released in February 2025 by Tencent YouTu Lab, Nanjing University, and Xiamen University. This model maintains leading accuracy with short contexts while extending the context length to 1 million tokens, enabling efficient processing of multimodal inputs such as text and images. The related paper is titled "...".Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy".

This tutorial uses a single RTX 4090 graphics card and deploys a Long-VITA-16K_HF model.

2. Effect Examples

Text Conversation

Image understanding

Video Understanding

3. Operation steps

1. After starting the container, click the API address to enter the Gradio interactive interface

2. Once you enter the webpage, you can use the model

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.

Precautions

For long context inputs, ensure sufficient video memory; it is recommended to load very large texts in batches.
Image input is recommended to have a side length of ≤ 2048 pixels to reduce inference latency.
If the inference fails, please check the input format or shorten the input length and try again.

Citation Information

The citation information for this project is as follows:

@misc{shen2025longvitascalinglargemultimodal,
      title={Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy}, 
      author={Yunhang Shen and Chaoyou Fu and Shaoqi Dong and Xiong Wang and Yi-Fan Zhang and Peixian Chen and Mengdan Zhang and Haoyu Cao and Ke Li and Xiawu Zheng and Yan Zhang and Yiyi Zhou and Ran He and Caifeng Shan and Rongrong Ji and Xing Sun},
      year={2025},
      eprint={2502.05177},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.05177}, 
}

This notebook is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Notebooks

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Run this Notebook Discuss on Discord

Date

4 months ago

Organization

GPU Compute Airdrop

20 Hours of RTX 5090 Compute Resources for Only $1 (Worth $7)

Redeem Now

1. Tutorial Introduction

This tutorial uses a single RTX 4090 graphics card and deploys a Long-VITA-16K_HF model.

2. Effect Examples

Text Conversation

Image understanding

Video Understanding

3. Operation steps

1. After starting the container, click the API address to enter the Gradio interactive interface

2. Once you enter the webpage, you can use the model

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.

Precautions

For long context inputs, ensure sufficient video memory; it is recommended to load very large texts in batches.
Image input is recommended to have a side length of ≤ 2048 pixels to reduce inference latency.
If the inference fails, please check the input format or shorten the input length and try again.

Citation Information

The citation information for this project is as follows:

@misc{shen2025longvitascalinglargemultimodal,
      title={Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy}, 
      author={Yunhang Shen and Chaoyou Fu and Shaoqi Dong and Xiong Wang and Yi-Fan Zhang and Peixian Chen and Mengdan Zhang and Haoyu Cao and Ke Li and Xiawu Zheng and Yan Zhang and Yiyi Zhou and Ran He and Caifeng Shan and Rongrong Ji and Xing Sun},
      year={2025},
      eprint={2502.05177},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.05177}, 
}

Related Notebooks

Qwen3-TTS: High-Quality Controllable Multilingual Speech Synthesis Demo

2 months ago

VibeVoice-ASR: Multifunctional End-to-End Speech Recognition Demo

2 months ago

ACE-Step 1.5: Music Generation Demo

2 months ago

Phi-4-reasoning-vision-15B Multimodal Reasoning Vision Model Demo

2 months ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Long-VITA: A Multimodal Understanding Demo With Millions of Tokens

GPU Compute Airdrop

1. Tutorial Introduction

2. Effect Examples

3. Operation steps

1. After starting the container, click the API address to enter the Gradio interactive interface

2. Once you enter the webpage, you can use the model

Precautions

Citation Information

Build AI with AI

HyperAI Newsletters

Command Palette

Long-VITA: A Multimodal Understanding Demo With Millions of Tokens

GPU Compute Airdrop

1. Tutorial Introduction

2. Effect Examples

3. Operation steps

1. After starting the container, click the API address to enter the Gradio interactive interface

2. Once you enter the webpage, you can use the model

Precautions

Citation Information

Related Notebooks

Qwen3-TTS: High-Quality Controllable Multilingual Speech Synthesis Demo

VibeVoice-ASR: Multifunctional End-to-End Speech Recognition Demo

ACE-Step 1.5: Music Generation Demo

Phi-4-reasoning-vision-15B Multimodal Reasoning Vision Model Demo

Build AI with AI

HyperAI Newsletters

Command Palette

Long-VITA: A Multimodal Understanding Demo With Millions of Tokens

GPU Compute Airdrop

1. Tutorial Introduction

2. Effect Examples

3. Operation steps

1. After starting the container, click the API address to enter the Gradio interactive interface

2. Once you enter the webpage, you can use the model

Precautions

Citation Information

Related Notebooks

Qwen3-TTS: High-Quality Controllable Multilingual Speech Synthesis Demo

VibeVoice-ASR: Multifunctional End-to-End Speech Recognition Demo

ACE-Step 1.5: Music Generation Demo

Phi-4-reasoning-vision-15B Multimodal Reasoning Vision Model Demo

Build AI with AI

HyperAI Newsletters

Related Notebooks

Qwen3-TTS: High-Quality Controllable Multilingual Speech Synthesis Demo

VibeVoice-ASR: Multifunctional End-to-End Speech Recognition Demo

ACE-Step 1.5: Music Generation Demo

Phi-4-reasoning-vision-15B Multimodal Reasoning Vision Model Demo

Related Notebooks

Qwen3-TTS: High-Quality Controllable Multilingual Speech Synthesis Demo

VibeVoice-ASR: Multifunctional End-to-End Speech Recognition Demo

ACE-Step 1.5: Music Generation Demo

Phi-4-reasoning-vision-15B Multimodal Reasoning Vision Model Demo