Xiangtai Li Ansheng You Zhen Zhu Houlong Zhao Maoke Yang Kuiyuan Yang Yunhai Tong

Abstract
In this paper, we focus on designing effective method for fast and accurate scene parsing. A common practice to improve the performance is to attain high resolution feature maps with strong semantic representation. Two strategies are widely used -- atrous convolutions and feature pyramid fusion, are either computation intensive or ineffective. Inspired by the Optical Flow for motion alignment between adjacent video frames, we propose a Flow Alignment Module (FAM) to learn Semantic Flow between feature maps of adjacent levels, and broadcast high-level features to high resolution features effectively and efficiently. Furthermore, integrating our module to a common feature pyramid structure exhibits superior performance over other real-time methods even on light-weight backbone networks, such as ResNet-18. Extensive experiments are conducted on several challenging datasets, including Cityscapes, PASCAL Context, ADE20K and CamVid. Especially, our network is the first to achieve 80.4% mIoU on Cityscapes with a frame rate of 26 FPS. The code is available at \url{https://github.com/lxtGH/SFSegNets}.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| real-time-semantic-segmentation-on-cityscapes | SFNet-R18 | Frame (fps): 25.7(1080Ti) Time (ms): 39.2 mIoU: 80.4% |
| semantic-segmentation-on-bdd100k-val | SFNet(ResNet-18) | mIoU: 60.6(132.5FPS 4090) |
| semantic-segmentation-on-bdd100k-val | SFNet(DF1) | mIoU: 55.4(70.3fps) |
| semantic-segmentation-on-bdd100k-val | SFNet(DF2) | mIoU: 60.2(208FPS 4090) |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.