Semantic Understanding of Urban Street Scenes

Method Details

Details for method 'SSMA'

Method overview

name	SSMA
challenge	pixel-level semantic labeling
details	Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the real-world. This problem is inherently challenging due to the multitude of object types as well as appearance changes caused by varying illumination and weather conditions. Leveraging complementary modalities can enable learning of semantically richer representations that are resilient to such perturbations. Despite the tremendous progress in recent years, most multimodal convolutional neural network approaches directly concatenate feature maps from individual modality streams rendering the model incapable of focusing only on the relevant complementary information for fusion. To address this limitation, we propose a mutimodal semantic segmentation framework that dynamically adapts the fusion of modality-specific features while being sensitive to the object category, spatial location and scene context in a self-supervised manner. Specifically, we propose an architecture consisting of two modality-specific encoder streams that fuse intermediate encoder representations into a single decoder using our proposed SSMA fusion mechanism which optimally combines complementary features. As intermediate representations are not aligned across modalities, we introduce an attention scheme for better correlation. Extensive experimental evaluations on the challenging Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest datasets demonstrate that our architecture achieves state-of-the-art performance in addition to providing exceptional robustness in adverse perceptual conditions. Please refer to https://arxiv.org/abs/1808.03833 for details. A live demo on various datasets can be viewed at http://deepscene.cs.uni-freiburg.de
publication	Self-Supervised Model Adaptation for Multimodal Semantic Segmentation Abhinav Valada, Rohit Mohan, Wolfram Burgard IJCV 2019 https://arxiv.org/abs/1808.03833
project page / code	http://deepscene.cs.uni-freiburg.de
used Cityscapes data	fine annotations, coarse annotations, stereo
used external data	ImageNet
runtime	n/a
subsampling	no
submission date	January, 2019
previous submissions

Average results

Metric	Value
IoU Classes	82.312
iIoU Classes	62.2501
IoU Categories	91.5078
iIoU Categories	81.7139

Class results

Class	IoU	iIoU
road	98.6664	-
sidewalk	86.884	-
building	93.605	-
wall	57.8519	-
fence	63.4302	-
pole	68.938	-
traffic light	77.1464	-
traffic sign	81.1373	-
vegetation	93.8571	-
terrain	73.0615	-
sky	95.3172	-
person	87.4316	72.6122
rider	73.7845	52.3686
car	96.3584	91.4028
truck	81.1375	47.834
bus	93.4868	58.08
train	89.9538	58.6083
motorcycle	73.5405	51.6583
bicycle	78.3401	65.437

Category results

Category	IoU	iIoU
flat	98.7282	-
nature	93.4833	-
object	75.2108	-
sky	95.3172	-
construction	93.9055	-
human	87.8295	73.6083
vehicle	96.0798	89.8195

Links

Download results as .csv file