Yuan Gong, Member, IEEE, Alexander H. Liu, Andrew Rouditchenko, and James Glass, Fellow, IEEE

Abstract
Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-classification-on-audioset | UAVM (Audio + Video) | Test mAP: 0.504 |
| audio-classification-on-vggsound | UAVM (Audio + Video) | Top 1 Accuracy: 65.8 |
| audio-classification-on-vggsound | UAVM (Audio Only) | Top 1 Accuracy: 56.5 |
| audio-classification-on-vggsound | UAVM (Video Only) | Top 1 Accuracy: 49.9 |
| multi-modal-classification-on-audioset | UAVM | Average mAP: 0.504 |
| multi-modal-classification-on-vgg-sound | UAVM | Top-1 Accuracy: 65.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.