MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Dohwan Ko Joonmyung Choi Hyeong Kyu Choi Kyoung-Woon On Byungseok Roh Hyunwoo J. Kim

Abstract
Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately transforms' individual loss functions andmelts' them into an effective unified loss. Code is available at https://github.com/mlvlab/MELTR.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| multimodal-sentiment-analysis-on-cmu-mosi | UniVL + MELTR | Acc-2: 85.3 Corr: 0.789 F1: 85.4 MAE: 0.759 |
| video-captioning-on-msr-vtt-1 | UniVL + MELTR | BLEU-4: 44.17 CIDEr: 52.77 METEOR: 29.26 ROUGE-L: 62.35 |
| video-captioning-on-youcook2 | UniVL + MELTR | BLEU-3: 24.12 BLEU-4: 17.92 CIDEr: 1.90 METEOR: 22.56 ROUGE-L: 47.04 |
| video-retrieval-on-msr-vtt | All-in-one + MELTR | text-to-video R@1: 38.6 text-to-video R@10: 84.7 text-to-video R@5: 74.4 |
| video-retrieval-on-msr-vtt | VIOLET + MELTR | text-to-video Median Rank: 3 text-to-video R@1: 33.6 text-to-video R@10: 77.8 text-to-video R@5: 63.7 |
| video-retrieval-on-msr-vtt | UniVL + MELTR | text-to-video Median Rank: 4 text-to-video R@1: 28.5 text-to-video R@10: 67.6 text-to-video R@5: 55.5 |
| video-retrieval-on-msr-vtt-1ka | UniVL + MELTR | text-to-video Median Rank: 4 text-to-video R@1: 31.1 text-to-video R@10: 68.3 text-to-video R@5: 55.7 |
| video-retrieval-on-msr-vtt-1ka | All-in-one + MELTR | text-to-video R@1: 41.3 text-to-video R@10: 82.5 text-to-video R@5: 73.5 |
| video-retrieval-on-msr-vtt-1ka | VIOLET + MELTR | text-to-video Median Rank: 3 text-to-video R@1: 35.5 text-to-video R@10: 78.4 text-to-video R@5: 67.2 |
| video-retrieval-on-youcook2 | UniVL + MELTR | text-to-video Median Rank: 3 text-to-video R@1: 33.7 text-to-video R@10: 74.8 text-to-video R@5: 63.1 |
| visual-question-answering-on-msvd-qa-1 | VIOLET + MELTR | Accuracy: 0.517 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.