Benchmark Suite


We offer a benchmark suite together with an evaluation server, such that authors can upload their results and get a ranking regarding the different tasks (pixel-level and instance-level semantic labeling). Our evaluation concept is designed such that a single algorithm can contribute to multiple challenges. If you would like to submit your results, please register, login, and follow the instructions on our submission page.

 

Pixel-Level Semantic Labeling Task

The first Cityscapes task involves predicting a per-pixel semantic labeling of the image without considering higher-level object instance or boundary information.

Metrics

To assess performance, we rely on the standard Jaccard Index, commonly known as the PASCAL VOC intersection-over-union metric IoU = TP ⁄ (TP+FP+FN) [1], where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set. Owing to the two semantic granularities, i.e. classes and categories, we report two separate mean performance scores: IoUcategory and IoUclass. In either case, pixels labeled as void do not contribute to the score.

It is well-known that the global IoU measure is biased toward object instances that cover a large image area. In street scenes with their strong scale variation this can be problematic. Specifically for traffic participants, which are the key classes in our scenario, we aim to evaluate how well the individual instances in the scene are represented in the labeling. To address this, we additionally evaluate the semantic labeling using an instance-level intersection-over-union metric iIoU = iTP ⁄ (iTP+FP+iFN). Again iTP, FP, and iFN denote the numbers of true positive, false positive, and false negative pixels, respectively. However, in contrast to the standard IoU measure, iTP and iFN are computed by weighting the contribution of each pixel by the ratio of the class’ average instance size to the size of the respective ground truth instance. It is important to note here that unlike the instance-level task below, we assume that the methods only yield a standard per-pixel semantic class labeling as output. Therefore, the false positive pixels are not associated with any instance and thus do not require normalization. The final scores, iIoUcategory and iIoUclass, are obtained as the means for the two semantic granularities.

Results

Detailed results

Detailed results including performances regarding individual classes and categories can be found here.

Usage
Use the buttons in the first row to hide columns or to export the visible data to various formats. Use the widgets in the second row to filter the table by selecting values of interest (multiple selections possible). Click the numeric columns for sorting.

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubIoU classiIoU classIoU categoryiIoU categoryRuntime [s]codecodetitleauthorsvenuedescription
FCN 8syesyesnononononononononono65.341.785.770.10.5noyesFully Convolutional Networks for Semantic SegmentationJ. Long, E. Shelhamer, and T. DarrellCVPR 2015Trained by Marius Cordts on a pre-release version of the dataset
more details
RRR-ResNet152-MultiScaleyesyesyesyesnononononononono75.848.589.374.0n/anonoAnonymousupdate: this submission actually used the coarse labels, which was previously not marked accordingly
more details
Dilation10yesyesnononononononononono67.142.086.571.14.0noyesMulti-Scale Context Aggregation by Dilated ConvolutionsFisher Yu and Vladlen KoltunICLR 2016Dilation10 is a convolutional network that consists of a front-end prediction module and a context aggregation module. Both are described in the paper. The combined network was trained jointly. The context module consists of 10 layers, each of which has C=19 feature maps. The larger number of layers in the context module (10 for Cityscapes versus 8 for Pascal VOC) is due to the high input resolution. The Dilation10 model is a pure convolutional network: there is no CRF and no structured prediction. Dilation10 can therefore be used as the baseline input for structured prediction models. Note that the reported results were produced by training on the training set only; the network was not retrained on train+val.
more details
Adelaideyesyesnononononononononono66.446.782.867.435.0nonoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationG. Lin, C. Shen, I. Reid, and A. van den HengelarXiv preprint 2015Trained on a pre-release version of the dataset
more details
DeepLab LargeFOV StrongWeakyesyesyesyesnononononono2264.834.981.358.74.0noyesWeakly- and Semi-Supervised Learning of a DCNN for Semantic Image SegmentationG. Papandreou, L.-C. Chen, K. Murphy, and A. L. YuilleICCV 2015Trained on a pre-release version of the dataset
more details
DeepLab LargeFOV Strongyesyesnononononononono2263.134.581.258.74.0noyesSemantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFsL.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. YuilleICLR 2015Trained on a pre-release version of the dataset
more details
DPNyesyesyesyesnononononono3359.128.179.557.9n/anonoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015Trained on a pre-release version of the dataset
more details
Segnet basicyesyesnononononononono4457.032.079.161.90.06noyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
Segnet extendedyesyesnononononononono4456.134.279.866.40.06noyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
CRFasRNNyesyesnononononononono2262.534.482.766.00.7noyesConditional Random Fields as Recurrent Neural NetworksS. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. TorrICCV 2015Trained on a pre-release version of the dataset
more details
Scale invariant CNN + CRFyesyesnonononoyesyesnononono66.344.985.071.2n/anoyesConvolutional Scale Invariance for Semantic SegmentationI. Kreso, D. Causevic, J. Krapac, and S. SegvicGCPR 2016We propose an effective technique to address large scale variation in images taken from a moving car by cross-breeding deep learning with stereo reconstruction. Our main contribution is a novel scale selection layer which extracts convolutional features at the scale which matches the corresponding reconstructed depth. The recovered scaleinvariant representation disentangles appearance from scale and frees the pixel-level classifier from the need to learn the laws of the perspective. This results in improved segmentation results due to more effi- cient exploitation of representation capacity and training data. We perform experiments on two challenging stereoscopic datasets (KITTI and Cityscapes) and report competitive class-level IoU performance.
more details
DPNyesyesnononononononononono66.839.186.069.1n/anonoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015DPN trained on full resolution images
more details
Pixel-level Encoding for Instance Segmentationyesyesnonononoyesyesnononono64.341.685.973.9n/anonoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
Adelaide_contextyesyesnononononononononono71.651.787.374.1n/anonoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationGuosheng Lin, Chunhua Shen, Anton van den Hengel, Ian ReidCVPR 2016We explore contextual information to improve semantic image segmentation. Details are described in the paper. We trained contextual networks for coarse level prediction and a refinement network for refining the coarse prediction. Our models are trained on the training set only (2975 images) without adding the validation set.
more details
NVSegNetyesyesnononononononononono67.441.487.268.10.4nonoAnonymousIn the inference, we use the image of 2 different scales. The same for training!
more details
ENetyesyesnononononononono2258.334.480.464.00.013noyesENet: A Deep Neural Network Architecture for Real-Time Semantic SegmentationAdam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello
more details
DeepLabv2-CRFyesyesnononononononononono70.442.686.467.7n/anoyesDeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFsLiang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. YuillearXiv preprintDeepLabv2-CRF is based on three main methods. First, we employ convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool to repurpose ResNet-101 (trained on image classification task) in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within DCNNs. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and fully connected Conditional Random Fields (CRFs). The model is only trained on train set.
more details
m-TCFsyesyesyesyesnononononononono71.843.687.670.61.0nonoAnonymousConvolutional Neural Network
more details
DeepLab+DynamicCRFyesyesnononononononononono64.538.383.762.4n/anonoru.nl
more details
LRR-4xyesyesnononononononononono69.748.088.274.7n/anoyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained on the training set (2975 images). The segmentation predictions were not post-processed using CRF. (This is a revision of a previous submission in which we didn't use the correct basis functions; the method name changed from 'LLR-4x' to 'LRR-4x')
more details
LRR-4xyesyesyesyesnononononononono71.847.988.473.9n/anoyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained using both coarse and fine annotations. The segmentation predictions were not post-processed using CRF.
more details
Le_Selfdriving_VGGyesyesnononononononononono65.935.684.464.3n/anonoAnonymous
more details
SQyesyesnononononononononono59.832.384.366.00.06nonoSpeeding up Semantic Segmentation for Autonomous DrivingMichael Treml, José Arjona-Medina, Thomas Unterthiner, Rupesh Durgesh, Felix Friedmann, Peter Schuberth, Andreas Mayr, Martin Heusel, Markus Hofmarcher, Michael Widrich, Bernhard Nessler, Sepp HochreiterNIPS 2016 Workshop - MLITS Machine Learning for Intelligent Transportation Systems Neural Information Processing Systems 2016, Barcelona, Spain
more details
SAITyesyesyesyesnononononononono76.951.889.675.54.0nonoAnonymousAnonymous
more details
FoveaNetyesyesnononononononononono74.152.489.377.6n/anonoFoveaNetXin Li, Jiashi Feng1.caffe-master
2.resnet-101
3.single scale testing

Previously listed as "LXFCRN".
more details
RefineNetyesyesnononononononononono73.647.287.970.6n/anoyesRefineNet: Multi-Path Refinement Networks for High-Resolution Semantic SegmentationGuosheng Lin; Anton Milan; Chunhua Shen; Ian Reid;Please refer to our technical report for details: "RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation" (https://arxiv.org/abs/1611.06612). Our source code is available at: https://github.com/guosheng/refinenet
2975 images (training set with fine labels) are used for training.
more details
SegModelyesyesnononononononononono78.556.189.875.90.8nonoAnonymousBoth train set (2975) and val set (500) are used to train model for this submission.
more details
TuSimpleyesyesnononononononononono77.653.690.175.2n/anoyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison Cottrell
more details
Global-Local-Refinementyesyesnononononononononono77.353.490.076.8n/anonoGlobal-residual and Local-boundary Refinement Networks for Rectifying Scene Parsing PredictionsRui Zhang, Sheng Tang, Min Lin, Jintao Li, Shuicheng YanInternational Joint Conference on Artificial Intelligence (IJCAI) 2017global-residual and local-boundary refinement

The method was previously listed as "RefineNet". To avoid confusions with a recently appeared and similarly named approach, the submission name was updated.
more details
XPARSEyesyesnononononononononono73.449.288.774.2n/anonoAnonymous
more details
ResNet-38yesyesnononononononononono78.459.190.981.1n/anoyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, single scale, no post-processing with CRFs
Model A2, 2 conv., fine only, single scale testing

The submissions was previously listed as "Model A2, 2 conv.". The name was changed for consistency with the other submission of the same work.
more details
SegModelyesyesyesyesnononononononono79.256.490.477.0n/anonoAnonymous
more details
Deep Layer Cascade (LC)yesyesnononononononononono71.147.088.174.1n/anonoNot All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer CascadeXiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou TangCVPR 2017We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and
more details
FRRNyesyesnononononononono2271.845.588.975.1n/anoyesFull-Resolution Residual Networks for Semantic Segmentation in Street ScenesTobias Pohlen, Alexander Hermans, Markus Mathias, Bastian LeibeArxivFull-Resolution Residual Networks (FRRN) combine multi-scale context with pixel-level accuracy by using two processing streams within one network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition.
more details
MNet_MPRGyesyesnononononononononono71.946.689.377.90.6nonoChubu University, MPRGwithout val dataset, external dataset (e.g. image net) and post-processing
more details
ResNet-38yesyesyesyesnononononononono80.657.891.079.1n/anoyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, no post-processing with CRFs
Model A2, 2 conv., fine+coarse, multi scale testing
more details
FCN8s-QunjieYuyesyesnononononononononono57.434.581.868.7n/anonoAnonymous
more details
RGB-D FCNyesyesyesyesnonoyesyesnononono67.442.187.571.0n/anonoAnonymousGoogLeNet + depth branch, single model
no data augmentation, no training on validation set, no graphical model
Used coarse labels to initialize depth branch
more details
MultiBoostyesyesyesyesnonoyesyesnono2259.332.581.960.20.25nonoAnonymousBoosting based solution.
Publication is under review.
more details
GoogLeNet FCNyesyesnononononononononono63.038.685.869.8n/anonoGoing Deeper with ConvolutionsChristian Szegedy , Wei Liu , Yangqing Jia , Pierre Sermanet , Scott Reed , Dragomir Anguelov , Dumitru Erhan , Vincent Vanhoucke , Andrew RabinovichCVPR 2015GoogLeNet
No data augmentation, no graphical model
Trained by Lukas Schneider, following "Fully Convolutional Networks for Semantic Segmentation", Long et al. CVPR 2015
more details
ERFNet (pretrained)yesyesnononononononono2269.744.187.372.70.02noyesERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoTransactions on Intelligent Transportation Systems (T-ITS)ERFNet pretrained on ImageNet and trained only on the fine train (2975) annotated images


more details
ERFNet (from scratch)yesyesnononononononono2268.040.486.570.40.02noyesEfficient ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoIV2017ERFNet trained entirely on the fine train set (2975 images) without any pretraining nor coarse labels
more details
TuSimple_Coarseyesyesyesyesnononononononono80.156.990.777.8n/anoyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison CottrellHere we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are better for practical use. First, we implement dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields of the network to aggregate global information; 2) alleviates what we call the "gridding issue" caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a new state-of-art result of 80.1% mIOU in the test set. We also are state-of-the-art overall on the KITTI road estimation benchmark and the
PASCAL VOC2012 segmentation task. Pretrained models are available at https://goo.gl/DQMeun.
more details
SAC-multipleyesyesnononononononononono78.155.290.678.3n/anonoScale-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanInternational Conference on Computer Vision (ICCV) 2017
more details
NetWarpyesyesyesyesnonononoyesyesnono80.559.591.079.8n/anonoAnonymous
more details
depthAwareSeg_RNN_ffyesyesnononononononononono78.256.089.776.9n/anoyesAnonymoustraining with fine-annotated training images only (val set is not used); flip-augmentation only in training; single GPU for train&test; softmax loss; resnet101 as front end; multiscale test.
more details
Ladder DenseNetyesyesnononononononononono74.351.689.779.50.45noyesLadder-style DenseNets for Semantic Segmentation of Large Natural ImagesIvan Krešo, Josip Krapac, Siniša ŠegvićICCV 2017https://ivankreso.github.io/publication/ladder-densenet/
more details
Real-time FCNyesyesyesyesnononononononono72.645.587.971.60.044nonoUnderstanding Cityscapes: Efficient Urban Semantic Scene UnderstandingMarius CordtsDissertationCombines the following concepts:
Network architecture: "Going deeper with convolutions". Szegedy et al., CVPR 2015
Framework and skip connections: "Fully convolutional networks for semantic segmentation". Long et al., CVPR 2015
Context modules: "Multi-scale context aggregation by dilated convolutions". Yu and Kolutin, ICLR 2016
more details
GridNetyesyesnononononononononono69.544.187.971.1n/anonoAnonymousConv-Deconv Grid-Network for semantic segmentation.
Using only the training set without extra coarse annotated data (only 2975 images).
No pre-training (ImageNet).
No post-processing (like CRF).
more details
PEARLyesyesnonononononoyesyesnono75.451.689.275.1n/anonoVideo Scene Parsing with Predictive Feature LearningXiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, and Shuicheng YanICCV 2017We proposed a novel Parsing with prEdictive feAtuRe Learning (PEARL) model to address the following two problems in video scene parsing: firstly, how to effectively learn meaningful video representations for producing the temporally consistent labeling maps; secondly, how to overcome the problem of insufficient labeled video training data, i.e. how to effectively conduct unsupervised deep learning. To our knowledge, this is the first model to employ predictive feature learning in the video scene parsing.
more details
pruned & dilated inception-resnet-v2 (PD-IR2)yesyesyesyesnononononononono67.342.186.568.30.69noyesAnonymous
more details
PSPNetyesyesyesyesnononononononono81.259.691.279.2n/anoyesPyramid Scene Parsing NetworkHengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya JiaCVPR 2017This submission is trained on coarse+fine(train+val set, 2975+500 images).

Former submission is trained on coarse+fine(train set, 2975 images) which gets 80.2 mIoU: https://www.cityscapes-dataset.com/method-details/?submissionID=314

Previous versions of this method were listed as "SenseSeg_1026".
more details
motovisyesyesyesyesnononononononono81.357.791.580.7n/anonomotovis.com
more details
ML-CRNNyesyesnononononononononono71.247.187.772.5n/anonoMulti-level Contextual RNNs with Attention Model for Scene LabelingHeng Fan, Xue Mei, Danil Prokhorov, Haibin LingarXivA framework based on CNNs and RNNs is proposed, in which the RNNs are used to model spatial dependencies among image units. Besides, to enrich deep features, we use different features from multiple levels, and adopt a novel attention model to fuse them.
more details
Hybrid Modelyesyesnononononononononono65.841.285.268.5n/anonoAnonymous
more details
tek-Iflyyesyesnononononononononono81.160.190.979.6n/anonoIflytekIflytek-yinusing a fusion strategy of three single models, the best result of a single model is 80.01%,multi-scale
more details
GridNetyesyesnononononononononono69.844.588.171.4n/anoyesResidual Conv-Deconv Grid Network for Semantic SegmentationDamien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Tremeau & Christian WolfBMVC 2017We used a new architecture for semantic image segmentation called GridNet, following a grid pattern allowing multiple interconnected streams to work at different resolutions (see paper).
We used only the training set without extra coarse annotated data (only 2975 images) and no pre-training (ImageNet) nor pre or post-processing.
more details
firenetyesyesnononononononono2268.247.884.975.5n/anonoAnonymous
more details
DeepLabv3yesyesyesyesnononononononono81.362.191.681.7n/anonoRethinking Atrous Convolution for Semantic Image SegmentationLiang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig AdamarXiv preprintIn this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects
at multiple scales, we employ a module, called Atrous Spatial Pyrmid Pooling (ASPP), which adopts atrous convolution in parallel to capture multi-scale context with multiple atrous rates. Furthermore, we propose to augment ASPP module with image-level features encoding global context and further boost performance.
Results obtained with a single model (no ensemble), trained with fine + coarse annotations. More details will be shown in the updated arXiv report.
more details
EdgeSenseSegyesyesnononononononononono76.857.189.878.5n/anonoAnonymousDeep segmentation network with hard negative mining and other tricks.
more details
iFLYTEK-CVyesyesyesyesnononononononono81.460.991.079.5n/anonoIFLYTEK RESEARCHIFLYTEK CV Group - YinLinBoth fine(train&val) and coarse data were used to train a novel segmentation framework.
more details
ScaleNetyesyesyesyesnononononononono75.153.189.676.8n/anonoScaleNet: Scale Invariant Network for Semantic Segmentation in Urban Driving ScenesMohammad Dawud Ansari, Stephan Krarß, Oliver Wasenmüller and Didier StrickerInternational Conference on Computer Vision Theory and Applications, Funchal, Portugal, 2018The scale difference in driving scenarios is one of the essential challenges in semantic scene segmentation.
Close objects cover significantly more pixels than far objects. In this paper, we address this challenge with a
scale invariant architecture. Within this architecture, we explicitly estimate the depth and adapt the pooling
field size accordingly. Our model is compact and can be extended easily to other research domains. Finally,
the accuracy of our approach is comparable to the state-of-the-art and superior for scale problems. We evaluate
on the widely used automotive dataset Cityscapes as well as a self-recorded dataset.
more details
K-netyesyesnononononononononono76.052.888.875.4n/anonoXinLiang Zhong
more details
MSNETyesyesnononononononononono76.857.190.681.60.2nonoAnonymouspreviously also listed as "MultiPathJoin" and "MultiPath_Scale".
more details
Multitask Learningyesyesnononononononononono78.557.489.977.7n/anonoMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaNumerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
DeepMotionyesyesnononononononononono81.458.690.778.1n/anonoAnonymousWe propose a novel method based on convnets to extract multi-scale features in a large range particularly for solving street scene segmentation.
more details
SR-AICyesyesyesyesnononononononono81.960.791.379.6n/anonoAnonymous
more details
Roadstar.ai_CV(SFNet)yesyesnononononononononono79.260.891.082.60.2nonoRoadstar.ai-CVMaosheng Ye, Guang Zhou, Tongyi Cao, YongTao Huang, Yinzi Chensame foucs net(SFNet), based on only fine labels, with focus on the loss distribution and same focus on the every layer of feature map
more details
DFNyesyesyesyesnononononononono80.358.390.879.6n/anonoLearning a Discriminative Feature Network for Semantic SegmentationChangqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, Nong SangarxivMost existing methods of semantic segmentation still suffer from two aspects of challenges: intra-class inconsistency and inter-class indistinction. To tackle these two problems, we propose a Discriminative Feature Network (DFN), which contains two sub-networks: Smooth Network and Border Network. Specifically, to handle the intra-class inconsistency problem, we specially design a Smooth Network with Channel Attention Block and global average pooling to select the more discriminative features. Furthermore, we propose a Border Network to make the bilateral features of boundary distinguishable with deep semantic boundary supervision. Based on our proposed DFN, we achieve state-of-the-art performance 86.2% mean IOU on PASCAL VOC 2012 and 80.3% mean IOU on Cityscapes dataset.
more details
RelationNet_Coarseyesyesyesyesnononononononono82.461.991.881.4n/anonoRelationNet: Learning Deep-Aligned Representation for Semantic Image SegmentationYueqing ZhuangICPR Semantic image segmentation, which assigns labels in pixel level, plays a central role in image understanding. Recent approaches have attempted to harness the capabilities of deep learning. However, one central problem of these methods is that deep convolution neural network gives little consideration to the correlation among pixels. To handle this issue, in this paper, we propose a novel deep neural network named RelationNet, which utilizes CNN and RNN to aggregate context information. Besides, a spatial correlation loss is applied to supervise RelationNet to align features of spatial pixels belonging to same category. Importantly, since it is expensive to obtain pixel-wise annotations, we exploit a new training method for combining the coarsely and finely labeled data. Separate experiments show the detailed improvements of each proposal. Experimental results demonstrate the effectiveness of our proposed method to the problem of semantic image segmentation.
more details
ARSAITyesyesnononononononononono73.648.289.074.81.0nonoAnonymousanonymous
more details
Mapillary Research: In-Place Activated BatchNormyesyesyesyesnononononononono82.065.991.281.7n/anoyesIn-Place Activated BatchNorm for Memory-Optimized Training of DNNsSamuel Rota Bulò, Lorenzo Porzi, Peter KontschiederarXivIn-Place Activated Batch Normalization (InPlace-ABN) is a novel approach to drastically reduce the training memory footprint of modern deep neural networks in a computationally efficient way. Our solution substitutes the conventionally used succession of BatchNorm + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. We obtain memory savings of up to 50% by dropping intermediate results and by recovering required information during the backward pass through the inversion of stored forward results, with only minor increase (0.8-2%) in computation time. Test results are obtained using a single model.
more details
EFBNETyesyesnononononononononono81.859.990.778.8n/anonoAnonymous
more details
Ladder DenseNet v2yesyesnononononononononono78.454.690.878.71.0nonoJournal submissionAnonymousDenseNet-121 model used in downsampling path with ladder-style skip connections upsampling path on top of it.
more details
ESPNetyesyesnononononononono2260.331.882.263.10.0089noyesESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh HajishirziWe introduce a fast and efficient convolutional neural network, ESPNet, for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a new convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power. ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less. We evaluated EPSNet on a variety of semantic segmentation datasets including Cityscapes, PASCAL VOC, and a breast biopsy whole slide image dataset. Under the same constraints on memory and computation, ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet on both standard metrics and our newly introduced performance metrics that measure efficiency on edge devices. Our network can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively
more details
ENet with the Lovász-Softmax lossyesyesnononononononono2263.134.183.661.00.013noyesThe Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networksMaxim Berman, Amal Rannen Triki, Matthew B. BlaschkoarxivThe Lovász-Softmax loss is a novel surrogate for optimizing the IoU measure in neural networks. Here we finetune the weights provided by the authors of ENet (arXiv:1606.02147) with this loss, for 10'000 iterations on training dataset. The runtimes are unchanged with respect to the ENet architecture.
more details
DRN_CRL_Coarseyesyesyesyesnononononononono82.861.191.880.7n/anoyesDense Relation Network: Learning Consistent and Context-Aware Representation For Semantic Image SegmentationYueqing ZhuangICIPDRN_CoarseSemantic image segmentation, which aims at assigning pixel-wise category, is one of challenging image understanding problems. Global context plays an important role on local pixel-wise category assignment. To make the best of global context, in this paper, we propose dense relation network (DRN) and context-restricted loss (CRL) to aggregate global and local information. DRN uses Recurrent Neural Network (RNN) with different skip lengths in spatial directions to get context-aware representations while CRL helps aggregate them to learn consistency. Compared with previous methods, our proposed method takes full advantage of hierarchical contextual representations to produce high-quality results. Extensive experiments demonstrate that our methods achieves significant state-of-the-art performances on Cityscapes and Pascal Context benchmarks, with mean-IoU of 82.8\% and 49.0\% respectively.
more details
ShuffleSegyesyesyesyesnononononononono58.332.480.262.2n/anonoShuffleSeg: Real-time Semantic Segmentation NetworkMostafa Gamal, Mennatullah Siam, Mo'men Abdel-RazekUnder Review by ICIP 2018ShuffleSeg: An efficient realtime semantic segmentation network with skip connections and ShuffleNet units
more details
SkipNet-MobileNetyesyesyesyesnononononononono61.535.282.063.0n/anonoRTSeg: Real-time Semantic Segmentation FrameworkMennatullah Siam, Mostafa Gamal, Moemen Abdel-Razek, Senthil Yogamani, Martin JagersandUnder Review by ICIP 2018An efficient realtime semantic segmentation network with skip connections based on MobileNet.

more details
Kronecker Convolution Networksyesyesnononononononononono78.956.990.777.51.0nonoAnonymousWe proposed a novel kronecker convolution networks for semantic image segmentation.
more details
ThunderNetyesyesnononononononono2264.040.484.169.30.0104nonoAnonymous
more details
SU_Netnononononononononononono75.352.388.575.0n/anonoAnonymous
more details
MobileNetV2Plusyesyesnononononononononono70.746.887.672.9n/anonoHuijun LiuMobileNetV2Plus
more details
DeepLabv3+yesyesyesyesnononononononono82.162.492.081.9n/anoyes Encoder-Decoder with Atrous Separable Convolution for Semantic Image SegmentationLiang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig AdamarXivSpatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We will provide more details in the coming update on the arXiv report.
more details
RFMobileNetV2Plusyesyesnononononononononono70.749.488.375.8n/anonoHuijun LiuReceptive Filed MobileNetV2Plus for Semantic Segmentation
more details
GoogLeNetV1_ROByesyesnononononononononono59.635.183.064.4n/anonoAnonymousGoogLeNet-v1 FCN trained on Cityscapes, KITTI, and ScanNet, as required by the Robust Vision Challenge at CVPR'18 (http://robustvision.net/)
more details
SAITv2yesyesyesyesnononononononono70.036.584.562.10.025nonoAnonymous
more details
GUNetyesyesnononononononono2270.440.886.869.10.03nonoGuided Upsampling Network for Real-Time Semantic SegmentationDavide MazziniarxivGuided Upsampling Network for Real-Time Semantic Segmentation
more details
TKCNetyesyesnononononononononono78.956.990.777.51.0nonoTree-structured Kronecker Convolutional Networks for Semantic SegmentationTianyi Wu, Sheng Tang, Rui Zhang Linghui Li, Yongdong ZhangMost existing semantic segmentation methods employ atrous convolution to enlarge the receptive field of filters, but neglect important local contextual information. To tackle this issue, we firstly propose a novel Kronecker convolution which adopts Kronecker product to expand its kernel for taking into account the feature vectors neglected by atrous convolutions. Therefore, it can capture local contextual information and enlarge the field of view of filters simultaneously without introducing extra parameters. Secondly, we propose Tree-structured Feature Aggregation (TFA) module which follows a recursive rule to expand and forms a hierarchical structure. Thus, it can naturally learn representations of multi-scale objects and encode hierarchical contextual information in complex scenes. Finally, we design Tree-structured Kronecker Convolutional Networks (TKCN) that employs Kronecker convolution and TFA module. Extensive experiments on three datasets, PASCAL VOC 2012, PASCAL-Context and Cityscapes, verify the e
more details
RMNetyesyesnononononononononono64.537.384.667.70.014nonoAnonymousA fast and light net for semantic segmentation.
more details
ContextNetyesyesnononononononononono66.136.882.864.30.0238nonoContextNet: Exploring Context and Detail for Semantic Segmentation in Real-timeRudra PK Poudel, Ujwal Bonde, Stephan Liwicki, Christopher ZacharXivModern deep learning architectures produce highly accurate results on many challenging semantic segmentation datasets. State-of-the-art methods are, however, not directly transferable to real-time applications or embedded devices, since naive adaptation of such systems to reduce computational cost (speed, memory and energy) causes a significant drop in accuracy. We propose ContextNet, a new deep neural network architecture which builds on factorized convolution, network compression and pyramid representations to produce competitive semantic segmentation in real-time with low memory requirements. ContextNet combines a deep branch at low resolution that captures global context information efficiently with a shallow branch that focuses on high-resolution segmentation details. We analyze our network in a thorough ablation study and present results on the Cityscapes dataset, achieving 66.1% accuracy at 18.3 frames per second at full (1024x2048) resolution.
more details
DPCyesyesyesyesnononononononono82.763.392.082.5n/anoyesSearching for Efficient Multi-Scale Architectures for Dense Image PredictionLiang-Chieh Chen, Maxwell D. Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, Jonathon ShlensNIPS 2018In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that achieve state-of-the-art performance. Additionally, the resulting architecture (called DPC for Dense Prediction Cell) is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.
more details
NV-ADLRyesyesyesyesnononononononono83.264.292.182.2n/anonoAnonymousNVIDIA Applied Deep Learning Research
more details
Adaptive Affinity Field on PSPNetyesyesnononononononononono79.156.190.878.5n/anoyesAdaptive Affinity Field for Semantic SegmentationTsung-Wei Ke*, Jyh-Jing Hwang*, Ziwei Liu, Stella X. YuECCV 2018Existing semantic segmentation methods mostly rely on per-pixel supervision, unable to capture structural regularity present in natural images. Instead of learning to enforce semantic labels on individual pixels, we propose to enforce affinity field patterns in individual pixel neighbourhoods, i.e., the semantic label patterns of whether neighbouring pixels are in the same segment should match between the prediction and the ground-truth. The affinity fields characterize geometric relationships within the image, such as "motorcycles have round wheels". We further develop a novel method for learning the optimal neighbourhood size for each semantic category, with an adversarial loss that optimizes over worst-case scenarios. Unlike the common Conditional Random Field (CRF) approaches, our adaptive affinity field (AAF) method has no extra parameters during inference, and is less sensitive to appearance changes in the image.
more details
APMoE_seg_ROByesyesnononononononononono56.530.683.566.10.9noyesPixel-wise Attentional Gating for Parsimonious Pixel LabelingShu Kong, Charless FowlkesarxivThe Pixel-level Attentional Gating (PAG) unit is trained to choose for each pixel the pooling size to adopt to aggregate contextual region around it. There are multiple branches with different dilate rates for varied pooling size, thus varying receptive field. For this ROB challenge, PAG is expected to robustly aggregate information for final prediction.

This is our entry for Robust Vision Challenge 2018 workshop (ROB). The model is based on ResNet50, trained over mixed dataset of Cityscapes, ScanNet and Kitti.
more details
BatMAN_ROByesyesyesyesnononononononono55.429.383.965.01.0nonoAnonymousbatch-normalized multistage attention network
more details
HiSS_ROByesyesnononononononono2258.932.181.460.50.06nonoAnonymous
more details
VENUS_ROByesyesnononononononononono66.437.184.566.7n/anonoAnonymousVENUS_ROB
more details
VlocNet++_ROBnononononononononononono62.733.983.460.8n/anonoAnonymous
more details
AHiSS_ROByesyesyesyesnononononono2270.639.884.262.90.06nonoAnonymousAugmented Hierarchical Semantic Segmentation
more details
IBN-PSP-SA_ROByesyesnononononononononono75.146.389.172.0n/anonoAnonymousIBN-PSP-SA_ROB
more details
LDN2_ROByesyesnononononononononono77.952.390.177.11.0nonoAnonymousLadder DenseNet: https://ivankreso.github.io/publication/ladder-densenet/
more details
MiniNetyesyesnononononononono4440.715.870.544.80.004nonoAnonymous
more details
AdapNetv2_ROByesyesnononononononononono63.834.984.362.4n/anonoAnonymous
more details
MapillaryAI_ROByesyesnononononononononono80.560.191.180.2n/anonoAnonymous
more details
FCN101_ROByesyesnononononononononono30.411.361.138.5n/anonoAnonymous
more details
MaskRCNN_BOSHyesyesnononononononononono73.948.587.270.0n/anonoJin shengtao, Yi zhihao, Liu wei [Our team name is firefly]Bosh autodrive challenge
more details
EnsembleModel_Boschyesyesnononononononononono74.448.988.572.9n/anonoJin shengtao, Yi zhihao, Liu wei [Our team name was MaskRCNN_BOSH,firefly]we've ensembled three model(erfnet,deeplab-mobilenet,tusimple) and gained 0.57 improvment of IoU Classes value. The best single model is 73.8549
more details
EVANetyesyesnononononononononono69.844.087.773.1n/anonoAnonymous
more details
CLRCNetyesyesnononononononononono63.335.984.468.00.013nonoCLRCNet: Cascaded Low-Rank Convolutions for Semantic Segmentation in Real-timeAnonymousA lightweight and real-time semantic segmentation method.
more details
Edgenetyesyesnononononononono2271.046.688.575.00.03nonoAnonymousA lightweight semantic segmentation network combined with edge information and channel-wise attention mechanism.
more details
L2-SPyesyesyesyesnononononononono81.258.191.078.5n/anoyesExplicit Inductive Bias for Transfer Learning with Convolutional NetworksXuhong Li, Yves Grandvalet, Franck DavoineICML-2018With a simple variant of weight decay, L2-SP regularization (see the paper for details), we reproduced PSPNet based on the original ResNet-101 using "train_fine + val_fine + train_extra" set (2975 + 500 + 20000 images), with a small batch size 8. The sync batch normalization layer is implemented in Tensorflow (see the code).
more details
ALV303yesyesnononononononononono72.252.089.879.20.2nonoAnonymous
more details
NCTU-ITRIyesyesnononononononono2269.141.486.870.80.0147nonoAnonymousFor the purpose of fast semantic segmentation, we design a CNN-based encoder-decoder architecture, which is called DSNet. The encoder part is constructed based on the concept of DenseNet, and a simple decoder is adopted to make the network more efficient without degrading the accuracy. We pre-train the encoder network on the ImageNet dataset. Then, only the fine-annotated Cityscapes dataset (2975 training images) is used to train the complete DSNet. The DSNet demonstrates a good trade-off between accuracy and speed. It can process 68 frames per second on 1024x512 resolution images on a single GTX 1080 Ti GPU.
more details
ADSCNetyesyesnononononononononono64.536.884.968.70.013nonoADSCNet: Asymmetric Depthwise Separable Convolution for Semantic Segmentation in Real-timeAnonymousA lightweight and real-time semantic segmentation method for mobile devices.
more details
SRC-B-MachineLearningLabyesyesyesyesnononononononono82.560.791.881.5n/anonoJianlong Yuan, Zelu Deng, Shu Wang, Zhenbo LuoSamsung Research Center MachineLearningLab. The result is tested by multi scale and filp. The paper is in preparing.
more details
Tencent AI Labyesyesyesyesnononononononono82.963.991.880.4n/anonoAnonymous
more details
ERINetyesyesnononononononono2269.844.187.473.40.023nonoAnonymousEfficient residual inception networks for real-time semantic segmentation
more details
PGCNet_Res101_fineyesyesnononononononononono80.560.791.581.1n/anonoAnonymouswe choose the ResNet101 pretrained on ImageNet as our backbone, then we use both the train-fine and the val-fine data to train our model with batch size=8 for 8w iterations without any bells and whistles. We will release our paper latter.
more details
EDANetyesyesnononononononono2267.341.885.869.90.0092noyesEfficient Dense Modules of Asymmetric Convolution for Real-Time Semantic SegmentationShao-Yuan Lo (NCTU), Hsueh-Ming Hang (NCTU), Sheng-Wei Chan (ITRI), Jing-Jhih Lin (ITRI)Training data: Fine annotations only (train+val. set, 2975+500 images) without any pretraining nor coarse annotations.
For training on fine annotations (train set only, 2975 images), it attains a mIoU of 66.3%.

Runtime: (resolution 512x1024) 0.0092s on a single GTX 1080Ti, 0.0123s on a single Titan X.
more details
OCNet_ResNet101_fineyesyesnononononononononono81.261.391.681.1n/anonoAnonymousContext is essential for various computer vision tasks.
The state-of-the-art scene parsing methods define the context as the prior of the scene categories (e.g., bathroom, badroom, street).
Such scene context is not suitable for the street scene parsing tasks as most of the scenes are similar.

In this work, we propose the Object Context that captures the prior of the object's category that the pixel belongs to.
We compute the object context by aggregating all the pixels' features according to a attention map that encodes the probability of each pixel that it belongs to the same category with the associated pixel.
Specifically, We employ the self-attention method to compute the pixel-wise attention map.

We further propose the Pyramid Object Context and Atrous Spatial Pyramid Object Context to handle the problem of multi-scales.
more details
Knowledge-Awareyesyesnononononononononono79.355.690.778.0n/anonoAnonymousKnowledge-Aware Semantic Segmentation
more details
LDFNetyesyesnonononoyesyesnono2271.346.388.574.2n/anoyesIncorporating Luminance, Depth and Color Information by Fusion-based Networks for Semantic SegmentationShang-Wei Hung, Shao-Yuan LoWe propose a preferred solution, which incorporates Luminance, Depth and color information by a Fusion-based network named LDFNet. It includes a distinctive encoder sub-network to process the depth maps and further employs the luminance images to assist the depth information in a process. LDFNet achieves very competitive results compared to the other state-of-art systems on the challenging Cityscapes dataset, while it maintains an inference speed faster than most of the existing top-performing networks. The experimental results show the effectiveness of the proposed information-fused approach and the potential of LDFNet for road scene understanding tasks.
more details
CGNetyesyesnononononononononono64.835.985.767.50.02noyesTianyi Wu et alwe propose a novel Context Guided Network for semantic segmentation on mobile devices. We first design a Context Guided (CG) block by considering the inherent characteristic of semantic segmentation. CG Block aggregates local feature, surrounding context feature and global context feature effectively and efficiently. Based on the CG block, we develop Context Guided Network (CGNet), which not only has a strong capacity of localization and recognition, but also has a low computational and memory footprint. Under a similar number of parameters, the proposed
CGNet significantly outperforms existing segmentation networks. Extensive experiments on Cityscapes and CamVid datasets verify the effectiveness of the proposed approach.
Specifically, without any post-processing, the proposed approach achieves 64.8% mean IoU on Cityscapes test set with less than 0.5 M parameters, and has a frame-rate of 50 fps on one NVIDIA Tesla K80 card for 2048 × 1024 high-resolution image.
more details
SAITv2-lightyesyesyesyesnononononononono73.044.087.469.40.025nonoAnonymous
more details
iFLYTEK-CVyesyesyesyesnononononononono83.664.792.182.3n/anonoAnonymousiFLYTEK Research, CV Group
more details
SwiftNetRN-18yesyesnononononononononono75.552.089.877.20.0243nonoAnonymous
more details

 

Instance-Level Semantic Labeling Task

In the second Cityscapes task we focus on simultaneously detecting objects and segmenting them. This is an extension to both traditional object detection, since per-instance segments must be provided, and pixel-level semantic labeling, since each instance is treated as a separate label. Therefore, algorithms are required to deliver a set of detections of traffic participants in the scene, each associated with a confidence score and a per-instance segmentation mask.

Metrics

To assess instance-level performance, we compute the average precision on the region level (AP [2]) for each class and average it across a range of overlap thresholds to avoid a bias towards a specific value. Specifically, we follow [3] and use 10 different overlaps ranging from 0.5 to 0.95 in steps of 0.05. The overlap is computed at the region level, making it equivalent to the IoU of a single instance. We penalize multiple predictions of the same ground truth instance as false positives. To obtain a single, easy to compare compound score, we report the mean average precision AP, obtained by also averaging over the class label set. As minor scores, we add AP50% for an overlap value of 50 %, as well as AP100m and AP50m where the evaluation is restricted to objects within 100 m and 50 m distance, respectively.

Results

Detailed results

Detailed results including performances regarding individual classes and categories can be found here.

Usage
Use the buttons in the first row to hide columns or to export the visible data to various formats. Use the widgets in the second row to filter the table by selecting values of interest (multiple selections possible). Click the numeric columns for sorting.

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubAPAP 50%AP 100mAP 50mRuntime [s]codecodetitleauthorsvenuedescription
R-CNN + MCG convex hullyesyesnononononononono224.612.97.710.360.0nonoThe Cityscapes Dataset for Semantic Urban Scene UnderstandingM. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. SchieleCVPR 2016We compute MCG object proposals [1] and use their convex hulls as instance candidates. These proposals are scored by a Fast R-CNN detector [2].

[1] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marqués, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
[2] R. Girshick. Fast R-CNN. In ICCV, 2015.
more details
Pixel-level Encoding for Instance Segmentationyesyesnonononoyesyesnononono8.921.115.316.7n/anonoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
Instance-level Segmentation of Vehicles by Deep Contoursyesyesnononononononono222.33.73.94.90.2nonoInstance-level Segmentation of Vehicles by Deep ContoursJan van den Brand, Matthias Ochs and Rudolf MesterAsian Conference on Computer Vision - Workshop on Computer Vision Technologies for Smart VehicleOur method uses the fully convolutional network (FCN) for semantic labeling and for estimating the boundary of each vehicle. Even though a contour is in general a one pixel wide structure which cannot be directly learned by a CNN, our network addresses this by providing areas around the contours. Based on these areas, we separate the individual vehicle instances.
more details
Boundary-aware Instance Segmentationyesyesnononononononono2217.436.729.334.0n/anonoBoundary-aware Instance SegmentationZeeshan Hayder, Xuming He, Mathieu SalzmannCVPR 2017End-to-end model for instance segmentation using VGG16 network

Previously listed as "Shape-Aware Instance Segmentation"
more details
RecAttendyesyesnononononononono449.518.916.820.9n/anonoAnonymous
more details
Joint Graph Decomposition and Node Labelingyesyesnononononononono889.823.216.820.3n/anonoJoint Graph Decomposition and Node Labeling: Problem, Algorithms, ApplicationsEvgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, Bjoern AndresComputer Vision and Pattern Recognition (CVPR) 2017
more details
InstanceCutyesyesyesyesnononononononono13.027.922.126.1n/anonoInstanceCut: from Edges to Instances with MultiCutA. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, C. RotherComputer Vision and Pattern Recognition (CVPR) 2017InstanceCut represents the problem by two output modalities: (i) an instance-agnostic semantic segmentation and (ii) all instance-boundaries. The former is computed from a standard CNN for semantic segmentation, and the latter is derived from a new instance-aware edge detection model. To reason globally about the optimal partitioning of an image into instances, we combine these two modalities into a novel MultiCut formulation.
more details
Semantic Instance Segmentation with a Discriminative Loss Functionyesyesnononononononono2217.535.927.831.0n/anoyesSemantic Instance Segmentation with a Discriminative Loss FunctionBert De Brabandere, Davy Neven, Luc Van GoolDeep Learning for Robotic Vision, workshop at CVPR 2017This method uses a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin.

Previously listed as "PPLoss".
more details
SGNyesyesyesyesnononononononono25.044.938.944.5n/anonoSGN: Sequential Grouping Networks for Instance SegmentationShu Liu, Jiaya Jia, Sanja Fidler, Raquel UrtasunICCV 2017Instance segmentation using a sequence of neural networks, each solving a sub-grouping problem of increasing semantic complexity in order to gradually compose objects out of pixels.
more details
Mask R-CNN [COCO]yesyesnononononononononono32.058.145.849.5n/anonoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes [fine-only] + COCO
more details
Mask R-CNN [fine-only]yesyesnononononononononono26.249.937.640.1n/anonoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes fine-only
more details
Deep Watershed Transformationyesyesnononononononono2219.435.331.436.8n/anonoDeep Watershed Transformation for Instance SegmentationMin Bai and Raquel UrtasunCVPR 2017Instance segmentation using a watershed transformation inspired CNN. The input RGB image is augmented using the semantic segmentation from the recent PSPNet by H. Zhao et al.
Previously named "DWT".
more details
Foveal Vision for Instance Segmentation of Road Imagesyesyesnonononoyesyesnononono12.525.220.422.1n/anonoFoveal Vision for Instance Segmentation of Road ImagesBenedikt Ortelt, Christian Herrmann, Dieter Willersinn, Jürgen BeyererVISAPP 2018Directly based on 'Pixel-level Encoding for Instance Segmentation'. Adds an improved angular distance measure and a foveal concept to better address small objects at the vanishing point of the road.
more details
SegNetyesyesyesyesnononononononono29.555.643.245.80.5nonoAnonymous
more details
PANet [fine-only]yesyesnononononononononono31.857.144.246.0n/anoyesPath Aggregation Network for Instance SegmentationShu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya JiaCVPR 2018PANet, ResNet-50 as base model, Cityscapes fine-only, training hyper-parameters are adopted from Mask R-CNN.
more details
PANet [COCO]yesyesnononononononononono36.463.149.251.8n/anoyesPath Aggregation Network for Instance SegmentationShu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya JiaCVPR 2018PANet, ResNet-50 as base model, Cityscapes fine-only + COCO, training hyper-parameters are adopted from Mask R-CNN.
more details
LCISyesyesnononononononononono15.130.824.225.8n/anonoAnonymous
more details
Pixelwise Instance Segmentation with a Dynamically Instantiated Networkyesyesyesyesnononononononono23.445.236.840.9n/anonoPixelwise Instance Segmentation with a Dynamically Instantiated NetworkAnurag Arnab and Philip H. S. TorrComputer Vision and Pattern Recognition (CVPR) 2017We propose an Instance Segmentation system that produces a segmentation map where each pixel is assigned an object class and instance identity label (this has recently been termed "Panoptic Segmentation"). Our method is based on an initial semantic segmentation module which feeds into an instance subnetwork. This subnetwork uses the initial category-level segmentation, along with cues from the output of an object detector, within an end-to-end CRF to predict instances. This part of our model is dynamically instantiated to produce a variable number of instances per image. Our end-to-end approach requires no post-processing and considers the image holistically, instead of processing independent proposals. As a result, it reasons about occlusions (unlike some related work, a single pixel cannot belong to multiple instances).

more details
PolygonRNN++yesyesnononononononononono25.545.539.343.4n/anoyesEfficient Annotation of Segmentation Datasets with Polygon-RNN++D. Acuna, H. Ling, A. Kar, and S. FidlerCVPR 2018
more details
GMIS: Graph Merge for Instance Segmentationyesyesyesyesnononononononono27.644.642.747.9n/anonoYiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, Yan Lu
more details
TCnetyesyesnononononononononono32.659.045.047.8n/anonoAnonymousTCnet
more details
MaskRCNN_ROByesyesnononononononononono10.225.214.614.6n/anonoAnonymousMaskRCNN Instance segmentation baseline for ROB challenge using default parameters from Matterport's implementation of Mask RCNN
https://github.com/matterport/Mask_RCNN
more details
Multitask Learningyesyesnononononononononono21.639.035.037.0n/anoyesMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaCVPR 2018Numerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
Deep Coloringyesyesnononononononononono24.946.239.044.0n/anonoAnonymousAnonymous ECCV submission #2955
more details
MRCNN_VSCMLab_ROByesyesnononononononononono14.829.524.829.31.0nonoAnonymousMaskRCNN+FPN with pre-trained COCO model.
ms-training with short edge [800, 1024]
inference with shore edge size 800
Randomly subsample ScanNet to the size close to CityScape

optimizer: Adam
learning rate: start from 1e-4 to 1e-3 with linear warm up schedule. decrease by factor of 0.1 at 200, 300 epoch.

epoch: 400
step per epoch: 500
roi_per_im: 512
more details
BAMRCNN_ROByesyesnononononononononono0.30.90.20.1n/anonoAnonymous
more details
NL_ROI_ROByesyesnononononononononono24.045.836.140.8n/anonoAnonymousNon-local ROI on Mask R-CNN
more details
RUSH_ROByesyesnononononononononono32.155.545.246.3n/anonoAnonymous
more details
MaskRCNN_BOSHyesyesnononononononononono12.828.022.126.7n/anonoJin shengtao, Yi zhihao, Liu wei [Our team name is firefly]MaskRCNN segmentation baseline for Bosh autodrive challenge ,
using Matterport's implementation of Mask RCNN https://github.com/matterport/Mask_RCNN
55k iterations, default parameters (backbone :resenet 101)
19hours for training
more details
NV-ADLRyesyesnononononononononono35.361.549.353.5n/anonoAnonymousNVIDIA Applied Deep Learning Research
more details
Sogou_MMyesyesnononononononononono37.264.551.154.5n/anonoGlobal Concatenating Feature Enhancement for Instance SegmentationHang Yang, Xiaozhe Xin, Wenwen Yang, Bin LiGlobal Concatenating Feature Enhancement for Instance Segmentation
more details
iFLYTEK-CVyesyesnononononononononono38.065.451.655.0n/anonoAnonymousiFLYTEK Research, CV Group
more details

 

Meta Information

In addition to the previously introduced measures, we report additional meta information for each method, such as timings or the kind of information each algorithm is using, e.g. depth data or multiple video frames. Please refer to the result tables for further details.

 

References

[1] M. Everingham, A. S. M. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes challenge: A retrospective,” IJCV, vol. 111, iss. 1, 2014.
[2] B. Hariharan, P. Arbeláez, R. B. Girshick, and J. Malik, “Simultaneous detection and segmentation,” in ECCV, 2014.
[3] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and L. C. Zitnick, “Microsoft COCO: Common objects in context,” in ECCV, 2014.