Speech Prompted Semantic Segmentation On

Metrics

mAP

mIoU

Results

Performance results of various models on this benchmark

			Paper Title
DenseAV	48.7	36.8	Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
DAVENet	32.2	26.3	Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
CAVMAE	27.2	19.9	Contrastive Audio-Visual Masked Autoencoder
ImageBIND	20.2	19.7	ImageBind: One Embedding Space To Bind Them All

0 of 4 row(s) selected.

Speech Prompted Semantic Segmentation On

Metrics

mAP

mIoU

Results

Performance results of various models on this benchmark

			Paper Title
DenseAV	48.7	36.8	Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
DAVENet	32.2	26.3	Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
CAVMAE	27.2	19.9	Contrastive Audio-Visual Masked Autoencoder
ImageBIND	20.2	19.7	ImageBind: One Embedding Space To Bind Them All

0 of 4 row(s) selected.