When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model
and Benchmark Dataset
When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset
Yi Zhang Wang Zeng Sheng Jin Chen Qian Ping Luo Wentao Liu

Abstract
Recent years have witnessed increasing research attention towards pedestriandetection by taking the advantages of different sensor modalities (e.g. RGB,IR, Depth, LiDAR and Event). However, designing a unified generalist model thatcan effectively process diverse sensor modalities remains a challenge. Thispaper introduces MMPedestron, a novel generalist model for multimodalperception. Unlike previous specialist models that only process one or a pairof specific modality inputs, MMPedestron is able to process multiple modalinputs and their dynamic combinations. The proposed approach comprises aunified encoder for modal representation and fusion and a general head forpedestrian detection. We introduce two extra learnable tokens, i.e. MAA andMAF, for adaptive multi-modal feature fusion. In addition, we construct theMMPD dataset, the first large-scale benchmark for multi-modal pedestriandetection. This benchmark incorporates existing public datasets and a newlycollected dataset called EventPed, covering a wide range of sensor modalitiesincluding RGB, IR, Depth, LiDAR, and Event data. With multi-modal jointtraining, our model achieves state-of-the-art performance on a wide range ofpedestrian detection benchmarks, surpassing leading models tailored forspecific sensor modality. For example, it achieves 71.1 AP on COCO-Persons and72.6 AP on LLVIP. Notably, our model achieves comparable performance to theInternImage-H model on CrowdHuman with 30x smaller parameters. Codes and dataare available at https://github.com/BubblyYi/MMPedestron.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| multispectral-object-detection-on-flir-1 | MMPedestron | mAP50: 86.4% |
| object-detection-on-crowdhuman-full-body | MMPedestron | AP: 97.1 mMR: 30.8 |
| object-detection-on-eventped | MMPedestron | AP: 79.0 |
| object-detection-on-inoutdoor | MMPedestron | AP: 65.7 |
| object-detection-on-stcrowd | MMPedestron | AP: 74.9 |
| pedestrian-detection-on-llvip | MMPedestron | AP: 0.726 |
| pedestrian-detection-on-mmpd-dataset | MMPedestron | box mAP: 79.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.