Detailed Results


On this page, we provide detailed results containing the performances of all methods in terms of all metrics on all classes and categories. Please refer to the Benchmark Suite for details on the evaluation and metrics. Jump to the individual tables via the following links:

Pixel-Level Semantic Labeling Task

Instance-Level Semantic Labeling Task

 

Usage

Within each table, use the buttons in the first row to hide columns or to export the visible data to various formats. Use the widgets in the second row to filter the table by selecting values of interest (multiple selections possible). Click the numeric columns for sorting.

 

Pixel-Level Semantic Labeling Task

 

IoU on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averageroadsidewalkbuildingwallfencepoletraffic lighttraffic signvegetationterrainskypersonridercartruckbustrainmotorcyclebicycle
FCN 8syesyesnononononononononononoyesFully Convolutional Networks for Semantic SegmentationJ. Long, E. Shelhamer, and T. DarrellCVPR 2015Trained by Marius Cordts on a pre-release version of the dataset
more details
0.565.397.478.489.234.944.247.460.165.091.469.393.977.151.492.635.348.646.551.666.8
RRR-ResNet152-MultiScaleyesyesyesyesnonononononononononoAnonymousupdate: this submission actually used the coarse labels, which was previously not marked accordingly
more details
n/a75.898.384.092.050.854.562.667.773.792.870.895.082.660.695.065.383.176.663.371.3
Dilation10yesyesnononononononononononoyesMulti-Scale Context Aggregation by Dilated ConvolutionsFisher Yu and Vladlen KoltunICLR 2016Dilation10 is a convolutional network that consists of a front-end prediction module and a context aggregation module. Both are described in the paper. The combined network was trained jointly. The context module consists of 10 layers, each of which has C=19 feature maps. The larger number of layers in the context module (10 for Cityscapes versus 8 for Pascal VOC) is due to the high input resolution. The Dilation10 model is a pure convolutional network: there is no CRF and no structured prediction. Dilation10 can therefore be used as the baseline input for structured prediction models. Note that the reported results were produced by training on the training set only; the network was not retrained on train+val.
more details
4.067.197.679.289.937.347.653.258.665.291.869.493.778.955.093.345.553.447.752.266.0
AdelaideyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationG. Lin, C. Shen, I. Reid, and A. van den HengelarXiv preprint 2015Trained on a pre-release version of the dataset
more details
35.066.497.378.588.444.548.334.155.561.790.169.592.272.552.391.054.661.651.655.063.1
DeepLab LargeFOV StrongWeakyesyesyesyesnononononono22noyesWeakly- and Semi-Supervised Learning of a DCNN for Semantic Image SegmentationG. Papandreou, L.-C. Chen, K. Murphy, and A. L. YuilleICCV 2015Trained on a pre-release version of the dataset
more details
4.064.897.478.388.147.544.229.544.455.489.467.392.871.049.391.455.966.656.748.158.1
DeepLab LargeFOV Strongyesyesnononononononono22noyesSemantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFsL.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. YuilleICLR 2015Trained on a pre-release version of the dataset
more details
4.063.197.377.787.743.640.529.744.555.489.467.092.771.249.491.448.756.749.147.958.6
DPNyesyesyesyesnononononono33nonoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015Trained on a pre-release version of the dataset
more details
n/a59.196.371.786.743.731.729.235.847.488.463.193.964.738.788.848.056.449.438.350.0
Segnet basicyesyesnononononononono44noyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0657.096.473.284.028.529.035.739.845.287.063.891.862.842.889.338.143.144.235.851.9
Segnet extendedyesyesnononononononono44noyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0656.195.670.182.829.931.938.143.144.687.362.391.767.350.787.921.729.034.740.556.6
CRFasRNNyesyesnononononononono22noyesConditional Random Fields as Recurrent Neural NetworksS. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. TorrICCV 2015Trained on a pre-release version of the dataset
more details
0.762.596.373.988.247.641.335.249.559.790.666.193.570.434.790.139.257.555.443.954.6
Scale invariant CNN + CRFyesyesnonononoyesyesnononononoyesConvolutional Scale Invariance for Semantic SegmentationI. Kreso, D. Causevic, J. Krapac, and S. SegvicGCPR 2016We propose an effective technique to address large scale variation in images taken from a moving car by cross-breeding deep learning with stereo reconstruction. Our main contribution is a novel scale selection layer which extracts convolutional features at the scale which matches the corresponding reconstructed depth. The recovered scaleinvariant representation disentangles appearance from scale and frees the pixel-level classifier from the need to learn the laws of the perspective. This results in improved segmentation results due to more effi- cient exploitation of representation capacity and training data. We perform experiments on two challenging stereoscopic datasets (KITTI and Cityscapes) and report competitive class-level IoU performance.
more details
n/a66.396.376.888.840.045.450.163.369.690.667.192.277.655.990.139.251.344.454.466.1
DPNyesyesnonononononononononononoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015DPN trained on full resolution images
more details
n/a66.897.578.589.540.445.951.156.865.391.569.494.577.554.292.544.553.449.952.164.8
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a64.397.477.788.827.740.151.560.164.791.167.693.577.754.292.433.742.042.552.566.5
Adelaide_contextyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationGuosheng Lin, Chunhua Shen, Anton van den Hengel, Ian ReidCVPR 2016We explore contextual information to improve semantic image segmentation. Details are described in the paper. We trained contextual networks for coarse level prediction and a refinement network for refining the coarse prediction. Our models are trained on the training set only (2975 images) without adding the validation set.
more details
n/a71.698.082.690.644.050.751.165.071.792.072.094.181.561.194.361.165.153.861.670.6
NVSegNetyesyesnonononononononononononoAnonymousIn the inference, we use the image of 2 different scales. The same for training!
more details
0.467.498.081.990.135.739.857.460.669.391.767.694.679.354.593.543.852.450.353.067.8
ENetyesyesnononononononono22noyesENet: A Deep Neural Network Architecture for Real-Time Semantic SegmentationAdam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello
more details
0.01358.396.374.285.032.233.243.534.144.088.661.490.665.538.490.636.950.548.138.855.4
DeepLabv2-CRFyesyesnononononononononononoyesDeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFsLiang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. YuillearXiv preprintDeepLabv2-CRF is based on three main methods. First, we employ convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool to repurpose ResNet-101 (trained on image classification task) in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within DCNNs. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and fully connected Conditional Random Fields (CRFs). The model is only trained on train set.
more details
n/a70.497.981.390.348.847.449.657.967.391.969.494.279.859.893.756.567.557.557.768.8
m-TCFsyesyesyesyesnonononononononononoAnonymousConvolutional Neural Network
more details
1.071.898.283.691.248.453.255.864.370.392.270.294.579.959.294.156.069.158.256.768.4
DeepLab+DynamicCRFyesyesnonononononononononononoru.nl
more details
n/a64.596.877.988.937.545.439.151.561.690.858.093.676.653.892.641.852.550.553.264.2
LRR-4xyesyesnononononononononononoyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained on the training set (2975 images). The segmentation predictions were not post-processed using CRF. (This is a revision of a previous submission in which we didn't use the correct basis functions; the method name changed from 'LLR-4x' to 'LRR-4x')
more details
n/a69.797.779.990.744.448.658.668.272.092.569.394.781.660.094.043.656.847.254.869.7
LRR-4xyesyesyesyesnononononononononoyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained using both coarse and fine annotations. The segmentation predictions were not post-processed using CRF.
more details
n/a71.897.981.591.450.552.759.466.872.792.570.195.081.360.194.351.267.754.655.669.6
Le_Selfdriving_VGGyesyesnonononononononononononoAnonymous
more details
n/a65.997.578.788.942.744.246.453.461.190.268.693.474.148.591.944.962.452.351.261.3
SQyesyesnonononononononononononoSpeeding up Semantic Segmentation for Autonomous DrivingMichael Treml, José Arjona-Medina, Thomas Unterthiner, Rupesh Durgesh, Felix Friedmann, Peter Schuberth, Andreas Mayr, Martin Heusel, Markus Hofmarcher, Michael Widrich, Bernhard Nessler, Sepp HochreiterNIPS 2016 Workshop - MLITS Machine Learning for Intelligent Transportation Systems Neural Information Processing Systems 2016, Barcelona, Spain
more details
0.0659.896.975.487.931.635.750.952.061.790.965.893.073.842.691.518.841.233.334.059.9
SAITyesyesyesyesnonononononononononoAnonymousAnonymous
more details
4.076.998.585.892.659.256.662.469.475.393.272.195.283.666.195.268.680.973.061.772.4
FoveaNetyesyesnonononononononononononoFoveaNetXin Li, Jiashi Feng1.caffe-master
2.resnet-101
3.single scale testing

Previously listed as "LXFCRN".
more details
n/a74.198.283.291.544.451.263.270.875.592.770.194.583.364.294.660.870.763.363.073.2
RefineNetyesyesnononononononononononoyesRefineNet: Multi-Path Refinement Networks for High-Resolution Semantic SegmentationGuosheng Lin; Anton Milan; Chunhua Shen; Ian Reid;Please refer to our technical report for details: "RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation" (https://arxiv.org/abs/1611.06612). Our source code is available at: https://github.com/guosheng/refinenet
2975 images (training set with fine labels) are used for training.
more details
n/a73.698.283.391.347.850.456.166.971.392.370.394.880.963.394.564.676.164.362.270.0
SegModelyesyesnonononononononononononoAnonymousBoth train set (2975) and val set (500) are used to train model for this submission.
more details
0.878.598.686.492.852.459.759.672.578.393.372.895.585.470.095.775.484.175.168.775.0
TuSimpleyesyesnononononononononononoyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison Cottrell
more details
n/a77.698.585.592.858.655.565.073.577.993.372.095.284.868.595.470.978.868.765.973.8
Global-Local-RefinementyesyesnonononononononononononoGlobal-residual and Local-boundary Refinement Networks for Rectifying Scene Parsing PredictionsRui Zhang, Sheng Tang, Min Lin, Jintao Li, Shuicheng YanInternational Joint Conference on Artificial Intelligence (IJCAI) 2017global-residual and local-boundary refinement

The method was previously listed as "RefineNet". To avoid confusions with a recently appeared and similarly named approach, the submission name was updated.
more details
n/a77.398.686.192.857.058.363.370.876.893.472.295.484.967.995.668.577.569.465.274.5
XPARSEyesyesnonononononononononononoAnonymous
more details
n/a73.498.383.991.647.653.459.566.872.592.770.995.282.463.594.757.468.862.262.671.5
ResNet-38yesyesnononononononononononoyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, single scale, no post-processing with CRFs
Model A2, 2 conv., fine only, single scale testing

The submissions was previously listed as "Model A2, 2 conv.". The name was changed for consistency with the other submission of the same work.
more details
n/a78.498.585.793.155.559.167.174.878.793.772.695.586.669.295.764.578.874.169.076.7
SegModelyesyesyesyesnonononononononononoAnonymous
more details
n/a79.298.686.293.053.760.464.273.578.593.472.295.585.368.695.877.987.078.068.075.1
Deep Layer Cascade (LC)yesyesnonononononononononononoNot All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer CascadeXiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou TangCVPR 2017We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and
more details
n/a71.198.182.891.247.152.857.363.970.792.570.594.281.257.994.150.159.657.058.671.1
FRRNyesyesnononononononono22noyesFull-Resolution Residual Networks for Semantic Segmentation in Street ScenesTobias Pohlen, Alexander Hermans, Markus Mathias, Bastian LeibeArxivFull-Resolution Residual Networks (FRRN) combine multi-scale context with pixel-level accuracy by using two processing streams within one network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition.
more details
n/a71.898.283.391.645.851.162.269.472.492.670.094.981.662.794.649.167.155.353.569.5
MNet_MPRGyesyesnonononononononononononoChubu University, MPRGwithout val dataset, external dataset (e.g. image net) and post-processing
more details
0.671.998.182.991.843.650.564.371.474.692.770.394.782.460.994.150.962.557.253.870.0
ResNet-38yesyesyesyesnononononononononoyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, no post-processing with CRFs
Model A2, 2 conv., fine+coarse, multi scale testing
more details
n/a80.698.786.993.360.462.967.675.078.793.773.795.586.871.196.175.287.681.969.876.7
FCN8s-QunjieYuyesyesnonononononononononononoAnonymous
more details
n/a57.496.774.588.030.737.845.58.363.191.768.593.375.845.492.015.430.525.742.564.9
RGB-D FCNyesyesyesyesnonoyesyesnonononononoAnonymousGoogLeNet + depth branch, single model
no data augmentation, no training on validation set, no graphical model
Used coarse labels to initialize depth branch
more details
n/a67.497.981.290.741.044.856.865.369.491.968.794.778.952.993.138.853.143.751.067.0
MultiBoostyesyesyesyesnonoyesyesnono22nonoAnonymousBoosting based solution.
Publication is under review.
more details
0.2559.395.969.587.334.432.740.554.958.689.265.390.368.442.589.022.551.940.936.555.7
GoogLeNet FCNyesyesnonononononononononononoGoing Deeper with ConvolutionsChristian Szegedy , Wei Liu , Yangqing Jia , Pierre Sermanet , Scott Reed , Dragomir Anguelov , Dumitru Erhan , Vincent Vanhoucke , Andrew RabinovichCVPR 2015GoogLeNet
No data augmentation, no graphical model
Trained by Lukas Schneider, following "Fully Convolutional Networks for Semantic Segmentation", Long et al. CVPR 2015
more details
n/a63.097.477.989.235.039.050.659.864.191.266.993.776.245.192.633.440.432.747.364.6
ERFNet (pretrained)yesyesnononononononono22noyesERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoTransactions on Intelligent Transportation Systems (T-ITS)ERFNet pretrained on ImageNet and trained only on the fine train (2975) annotated images


more details
0.0269.797.982.190.745.250.459.062.668.491.969.494.278.559.893.452.360.853.749.964.2
ERFNet (from scratch)yesyesnononononononono22noyesEfficient ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoIV2017ERFNet trained entirely on the fine train set (2975 images) without any pretraining nor coarse labels
more details
0.0268.097.781.089.842.548.056.259.865.391.468.294.276.857.192.850.860.151.847.361.6
TuSimple_CoarseyesyesyesyesnononononononononoyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison CottrellHere we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are better for practical use. First, we implement dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields of the network to aggregate global information; 2) alleviates what we call the "gridding issue" caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a new state-of-art result of 80.1% mIOU in the test set. We also are state-of-the-art overall on the KITTI road estimation benchmark and the
PASCAL VOC2012 segmentation task. Pretrained models are available at https://goo.gl/DQMeun.
more details
n/a80.198.585.993.257.761.167.273.778.093.472.395.485.970.595.976.190.683.767.475.7
SAC-multipleyesyesnonononononononononononoScale-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanInternational Conference on Computer Vision (ICCV) 2017
more details
n/a78.198.786.593.156.359.565.173.078.293.572.695.685.970.895.971.278.666.267.776.0
NetWarpyesyesyesyesnonononoyesyesnonononoAnonymous
more details
n/a80.598.686.793.460.662.668.675.980.093.572.095.386.572.195.972.989.977.470.576.4
depthAwareSeg_RNN_ffyesyesnononononononononononoyesAnonymoustraining with fine-annotated training images only (val set is not used); flip-augmentation only in training; single GPU for train&test; softmax loss; resnet101 as front end; multiscale test.
more details
n/a78.298.585.492.554.460.960.272.376.893.171.694.885.269.095.770.186.575.568.375.5
Ladder DenseNetyesyesnonononononononononononoAnonymousAnonymous ICCV submission 3205.
more details
0.4574.397.480.292.047.653.964.672.876.392.866.495.583.866.194.355.670.367.062.173.0
Real-time FCNyesyesyesyesnonononononononononoUnderstanding Cityscapes: Efficient Urban Semantic Scene UnderstandingMarius CordtsDissertationCombines the following concepts:
Network architecture: "Going deeper with convolutions". Szegedy et al., CVPR 2015
Framework and skip connections: "Fully convolutional networks for semantic segmentation". Long et al., CVPR 2015
Context modules: "Multi-scale context aggregation by dilated convolutions". Yu and Kolutin, ICLR 2016
more details
0.04472.698.081.491.144.650.757.364.171.292.168.594.781.261.294.654.576.572.257.668.7
GridNetyesyesnonononononononononononoAnonymousConv-Deconv Grid-Network for semantic segmentation.
Using only the training set without extra coarse annotated data (only 2975 images).
No pre-training (ImageNet).
No post-processing (like CRF).
more details
n/a69.598.082.890.841.848.359.365.469.492.469.293.881.862.393.141.856.249.055.269.1
PEARLyesyesnonononononoyesyesnonononoVideo Scene Parsing with Predictive Feature LearningXiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, and Shuicheng YanICCV 2017We proposed a novel Parsing with prEdictive feAtuRe Learning (PEARL) model to address the following two problems in video scene parsing: firstly, how to effectively learn meaningful video representations for producing the temporally consistent labeling maps; secondly, how to overcome the problem of insufficient labeled video training data, i.e. how to effectively conduct unsupervised deep learning. To our knowledge, this is the first model to employ predictive feature learning in the video scene parsing.
more details
n/a75.498.484.592.154.156.660.469.074.092.970.995.283.565.795.061.872.269.664.872.8
pruned & dilated inception-resnet-v2 (PD-IR2)yesyesyesyesnononononononononoyesAnonymous
more details
0.6967.397.981.990.239.547.754.858.169.991.370.494.477.051.992.940.754.355.245.565.1
PSPNetyesyesyesyesnononononononononoyesPyramid Scene Parsing NetworkHengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya JiaCVPR 2017This submission is trained on coarse+fine(train+val set, 2975+500 images).

Former submission is trained on coarse+fine(train set, 2975 images) which gets 80.2 mIoU: https://www.cityscapes-dataset.com/method-details/?submissionID=314

Previous versions of this method were listed as "SenseSeg_1026".
more details
n/a81.298.786.993.558.463.767.776.180.593.672.295.386.871.996.277.791.583.670.877.5
motovisyesyesyesyesnonononononononononomotovis.com
more details
n/a81.398.786.693.555.562.769.476.380.493.872.695.887.172.496.277.991.388.669.577.1
ML-CRNNyesyesnonononononononononononoMulti-level Contextual RNNs with Attention Model for Scene LabelingHeng Fan, Xue Mei, Danil Prokhorov, Haibin LingarXivA framework based on CNNs and RNNs is proposed, in which the RNNs are used to model spatial dependencies among image units. Besides, to enrich deep features, we use different features from multiple levels, and adopt a novel attention model to fuse them.
more details
n/a71.297.981.091.050.352.456.765.771.492.269.694.680.259.393.951.167.654.555.168.6
Hybrid ModelyesyesnonononononononononononoAnonymous
more details
n/a65.897.578.589.039.046.148.658.764.091.268.391.876.851.992.240.050.644.954.366.6
tek-IflyyesyesnonononononononononononoIflytekIflytek-yinusing a fusion strategy of three single models, the best result of a single model is 80.01%,multi-scale
more details
n/a81.198.686.393.561.264.166.075.679.193.772.895.686.369.996.076.890.786.871.077.1
GridNetyesyesnononononononononononoyesResidual Conv-Deconv Grid Network for Semantic SegmentationDamien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Tremeau & Christian WolfBMVC 2017We used a new architecture for semantic image segmentation called GridNet, following a grid pattern allowing multiple interconnected streams to work at different resolutions (see paper).
We used only the training set without extra coarse annotated data (only 2975 images) and no pre-training (ImageNet) nor pre or post-processing.
more details
n/a69.898.183.090.941.449.260.166.570.292.569.893.882.363.293.242.655.848.555.469.8
firenetyesyesnononononononono22nonoAnonymous
more details
n/a68.294.174.287.440.144.654.265.465.190.066.592.176.761.892.845.064.959.354.467.5
DeepLabv3yesyesyesyesnonononononononononoRethinking Atrous Convolution for Semantic Image SegmentationLiang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig AdamarXiv preprintIn this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects
at multiple scales, we employ a module, called Atrous Spatial Pyrmid Pooling (ASPP), which adopts atrous convolution in parallel to capture multi-scale context with multiple atrous rates. Furthermore, we propose to augment ASPP module with image-level features encoding global context and further boost performance.
Results obtained with a single model (no ensemble), trained with fine + coarse annotations. More details will be shown in the updated arXiv report.
more details
n/a81.398.686.293.555.263.270.077.181.393.872.395.987.673.496.375.190.485.172.178.3
EdgeSenseSegyesyesnonononononononononononoAnonymousDeep segmentation network with hard negative mining and other tricks.
more details
n/a76.898.484.892.552.058.161.573.076.193.371.895.085.268.595.462.477.670.766.875.5
iFLYTEK-CVyesyesyesyesnonononononononononoIFLYTEK RESEARCHIFLYTEK CV Group - YinLinBoth fine(train&val) and coarse data were used to train a novel segmentation framework.
more details
n/a81.498.686.393.461.664.267.275.579.293.673.095.686.470.396.077.390.988.371.177.0
ScaleNetyesyesyesyesnonononononononononoScaleNet: Scale Invariant Network for Semantic Segmentation in Urban Driving ScenesMohammad Dawud Ansari, Stephan Krarß, Oliver Wasenmüller and Didier StrickerInternational Conference on Computer Vision Theory and Applications, Funchal, Portugal, 2018The scale difference in driving scenarios is one of the essential challenges in semantic scene segmentation.
Close objects cover significantly more pixels than far objects. In this paper, we address this challenge with a
scale invariant architecture. Within this architecture, we explicitly estimate the depth and adapt the pooling
field size accordingly. Our model is compact and can be extended easily to other research domains. Finally,
the accuracy of our approach is comparable to the state-of-the-art and superior for scale problems. We evaluate
on the widely used automotive dataset Cityscapes as well as a self-recorded dataset.
more details
n/a75.198.384.892.450.159.662.871.876.893.271.494.683.665.295.156.071.659.966.373.6
K-netyesyesnonononononononononononoXinLiang Zhong
more details
n/a76.098.384.392.152.356.559.069.873.292.770.494.683.066.395.268.579.974.062.572.1
arsaityesyesnonononononononononononoanonymousanonymousanonymous
more details
0.462.397.981.188.038.737.335.548.858.589.468.793.168.544.191.542.951.548.141.258.3
MSNETyesyesnonononononononononononoAnonymouspreviously also listed as "MultiPathJoin" and "MultiPath_Scale".
more details
0.276.898.385.092.548.656.767.575.578.493.372.894.985.871.195.359.873.365.968.576.4
Multitask LearningyesyesnonononononononononononoMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaNumerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
n/a78.598.485.292.854.260.862.473.477.593.371.595.184.969.595.368.586.280.067.875.6
DeepMotionyesyesnonononononononononononoAnonymousWe propose a novel method based on convnets to extract multi-scale features in a large range particularly for solving street scene segmentation.
more details
n/a81.498.787.093.561.662.665.474.678.693.672.595.486.272.396.182.392.885.770.276.6
SR-AICyesyesyesyesnonononononononononoAnonymous
more details
n/a81.998.787.293.762.664.769.076.480.893.773.395.586.872.296.277.990.687.971.277.3
Roadstar.ai_CV(SFNet)yesyesnonononononononononononoRoadstar.ai-CVMaosheng Ye, Guang Zhou, Tongyi Cao, YongTao Huang, Yinzi Chensame foucs net(SFNet), based on only fine labels, with focus on the loss distribution and same focus on the every layer of feature map
more details
0.279.298.485.493.059.659.267.576.479.393.773.695.386.873.895.767.581.272.169.277.1
Mapillary Research: In-Place Activated BatchNormyesyesyesyesnononononononononoyesIn-Place Activated BatchNorm for Memory-Optimized Training of DNNsSamuel Rota Bulò, Lorenzo Porzi, Peter KontschiederarXivIn-Place Activated Batch Normalization (InPlace-ABN) is a novel approach to drastically reduce the training memory footprint of modern deep neural networks in a computationally efficient way. Our solution substitutes the conventionally used succession of BatchNorm + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. We obtain memory savings of up to 50% by dropping intermediate results and by recovering required information during the backward pass through the inversion of stored forward results, with only minor increase (0.8-2%) in computation time. Test results are obtained using a single model.
more details
n/a82.098.485.093.661.763.967.777.480.893.771.995.686.772.895.779.993.189.772.678.2

 
 

iIoU on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averagepersonridercartruckbustrainmotorcyclebicycle
FCN 8syesyesnononononononononononoyesFully Convolutional Networks for Semantic SegmentationJ. Long, E. Shelhamer, and T. DarrellCVPR 2015Trained by Marius Cordts on a pre-release version of the dataset
more details
0.541.755.933.483.922.230.826.731.149.6
RRR-ResNet152-MultiScaleyesyesyesyesnonononononononononoAnonymousupdate: this submission actually used the coarse labels, which was previously not marked accordingly
more details
n/a48.560.638.088.426.139.643.138.853.6
Dilation10yesyesnononononononononononoyesMulti-Scale Context Aggregation by Dilated ConvolutionsFisher Yu and Vladlen KoltunICLR 2016Dilation10 is a convolutional network that consists of a front-end prediction module and a context aggregation module. Both are described in the paper. The combined network was trained jointly. The context module consists of 10 layers, each of which has C=19 feature maps. The larger number of layers in the context module (10 for Cityscapes versus 8 for Pascal VOC) is due to the high input resolution. The Dilation10 model is a pure convolutional network: there is no CRF and no structured prediction. Dilation10 can therefore be used as the baseline input for structured prediction models. Note that the reported results were produced by training on the training set only; the network was not retrained on train+val.
more details
4.042.056.334.585.821.832.727.628.049.1
AdelaideyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationG. Lin, C. Shen, I. Reid, and A. van den HengelarXiv preprint 2015Trained on a pre-release version of the dataset
more details
35.046.756.238.077.134.047.033.438.149.9
DeepLab LargeFOV StrongWeakyesyesyesyesnononononono22noyesWeakly- and Semi-Supervised Learning of a DCNN for Semantic Image SegmentationG. Papandreou, L.-C. Chen, K. Murphy, and A. L. YuilleICCV 2015Trained on a pre-release version of the dataset
more details
4.034.940.723.178.621.432.427.620.834.6
DeepLab LargeFOV Strongyesyesnononononononono22noyesSemantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFsL.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. YuilleICLR 2015Trained on a pre-release version of the dataset
more details
4.034.540.523.378.820.331.924.821.135.2
DPNyesyesyesyesnononononono33nonoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015Trained on a pre-release version of the dataset
more details
n/a28.138.912.878.613.424.019.210.727.2
Segnet basicyesyesnononononononono44noyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0632.044.322.778.416.124.320.715.833.6
Segnet extendedyesyesnononononononono44noyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0634.249.927.181.115.323.718.519.638.4
CRFasRNNyesyesnononononononono22noyesConditional Random Fields as Recurrent Neural NetworksS. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. TorrICCV 2015Trained on a pre-release version of the dataset
more details
0.734.450.617.881.118.025.030.322.330.1
Scale invariant CNN + CRFyesyesnonononoyesyesnononononoyesConvolutional Scale Invariance for Semantic SegmentationI. Kreso, D. Causevic, J. Krapac, and S. SegvicGCPR 2016We propose an effective technique to address large scale variation in images taken from a moving car by cross-breeding deep learning with stereo reconstruction. Our main contribution is a novel scale selection layer which extracts convolutional features at the scale which matches the corresponding reconstructed depth. The recovered scaleinvariant representation disentangles appearance from scale and frees the pixel-level classifier from the need to learn the laws of the perspective. This results in improved segmentation results due to more effi- cient exploitation of representation capacity and training data. We perform experiments on two challenging stereoscopic datasets (KITTI and Cityscapes) and report competitive class-level IoU performance.
more details
n/a44.959.040.084.019.735.833.036.051.4
DPNyesyesnonononononononononononoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015DPN trained on full resolution images
more details
n/a39.153.628.985.020.128.324.924.846.9
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a41.660.633.486.719.525.625.830.550.5
Adelaide_contextyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationGuosheng Lin, Chunhua Shen, Anton van den Hengel, Ian ReidCVPR 2016We explore contextual information to improve semantic image segmentation. Details are described in the paper. We trained contextual networks for coarse level prediction and a refinement network for refining the coarse prediction. Our models are trained on the training set only (2975 images) without adding the validation set.
more details
n/a51.761.541.286.335.847.742.042.157.4
NVSegNetyesyesnonononononononononononoAnonymousIn the inference, we use the image of 2 different scales. The same for training!
more details
0.441.451.933.385.025.634.525.327.847.6
ENetyesyesnononononononono22noyesENet: A Deep Neural Network Architecture for Real-Time Semantic SegmentationAdam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello
more details
0.01334.447.620.880.017.526.821.820.939.4
DeepLabv2-CRFyesyesnononononononononononoyesDeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFsLiang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. YuillearXiv preprintDeepLabv2-CRF is based on three main methods. First, we employ convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool to repurpose ResNet-101 (trained on image classification task) in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within DCNNs. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and fully connected Conditional Random Fields (CRFs). The model is only trained on train set.
more details
n/a42.651.531.285.426.537.834.527.446.5
m-TCFsyesyesyesyesnonononononononononoAnonymousConvolutional Neural Network
more details
1.043.655.833.686.624.836.531.930.548.8
DeepLab+DynamicCRFyesyesnonononononononononononoru.nl
more details
n/a38.344.529.881.319.831.330.428.740.7
LRR-4xyesyesnononononononononononoyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained on the training set (2975 images). The segmentation predictions were not post-processed using CRF. (This is a revision of a previous submission in which we didn't use the correct basis functions; the method name changed from 'LLR-4x' to 'LRR-4x')
more details
n/a48.061.540.187.331.241.928.436.457.3
LRR-4xyesyesyesyesnononononononononoyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained using both coarse and fine annotations. The segmentation predictions were not post-processed using CRF.
more details
n/a47.961.039.786.730.440.134.535.455.2
Le_Selfdriving_VGGyesyesnonononononononononononoAnonymous
more details
n/a35.648.525.381.016.526.821.625.039.9
SQyesyesnonononononononononononoSpeeding up Semantic Segmentation for Autonomous DrivingMichael Treml, José Arjona-Medina, Thomas Unterthiner, Rupesh Durgesh, Felix Friedmann, Peter Schuberth, Andreas Mayr, Martin Heusel, Markus Hofmarcher, Michael Widrich, Bernhard Nessler, Sepp HochreiterNIPS 2016 Workshop - MLITS Machine Learning for Intelligent Transportation Systems Neural Information Processing Systems 2016, Barcelona, Spain
more details
0.0632.347.821.784.67.821.016.614.943.6
SAITyesyesyesyesnonononononononononoAnonymousAnonymous
more details
4.051.863.441.288.933.448.244.438.756.4
FoveaNetyesyesnonononononononononononoFoveaNetXin Li, Jiashi Feng1.caffe-master
2.resnet-101
3.single scale testing

Previously listed as "LXFCRN".
more details
n/a52.466.746.688.532.841.240.942.959.9
RefineNetyesyesnononononononononononoyesRefineNet: Multi-Path Refinement Networks for High-Resolution Semantic SegmentationGuosheng Lin; Anton Milan; Chunhua Shen; Ian Reid;Please refer to our technical report for details: "RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation" (https://arxiv.org/abs/1611.06612). Our source code is available at: https://github.com/guosheng/refinenet
2975 images (training set with fine labels) are used for training.
more details
n/a47.255.635.886.930.142.642.434.350.0
SegModelyesyesnonononononononononononoAnonymousBoth train set (2975) and val set (500) are used to train model for this submission.
more details
0.856.163.347.089.841.456.248.445.456.7
TuSimpleyesyesnononononononononononoyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison Cottrell
more details
n/a53.662.743.988.238.549.740.343.761.4
Global-Local-RefinementyesyesnonononononononononononoGlobal-residual and Local-boundary Refinement Networks for Rectifying Scene Parsing PredictionsRui Zhang, Sheng Tang, Min Lin, Jintao Li, Shuicheng YanInternational Joint Conference on Artificial Intelligence (IJCAI) 2017global-residual and local-boundary refinement

The method was previously listed as "RefineNet". To avoid confusions with a recently appeared and similarly named approach, the submission name was updated.
more details
n/a53.465.345.389.136.550.542.739.458.2
XPARSEyesyesnonononononononononononoAnonymous
more details
n/a49.261.740.587.531.342.735.938.355.4
ResNet-38yesyesnononononononononononoyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, single scale, no post-processing with CRFs
Model A2, 2 conv., fine only, single scale testing

The submissions was previously listed as "Model A2, 2 conv.". The name was changed for consistency with the other submission of the same work.
more details
n/a59.171.950.690.542.054.251.448.164.2
SegModelyesyesyesyesnonononononononononoAnonymous
more details
n/a56.465.246.790.141.555.148.345.359.2
Deep Layer Cascade (LC)yesyesnonononononononononononoNot All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer CascadeXiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou TangCVPR 2017We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and
more details
n/a47.060.536.187.926.142.132.335.156.0
FRRNyesyesnononononononono22noyesFull-Resolution Residual Networks for Semantic Segmentation in Street ScenesTobias Pohlen, Alexander Hermans, Markus Mathias, Bastian LeibeArxivFull-Resolution Residual Networks (FRRN) combine multi-scale context with pixel-level accuracy by using two processing streams within one network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition.
more details
n/a45.562.939.087.922.035.628.135.153.0
MNet_MPRGyesyesnonononononononononononoChubu University, MPRGwithout val dataset, external dataset (e.g. image net) and post-processing
more details
0.646.666.940.689.222.532.930.133.357.2
ResNet-38yesyesyesyesnononononononononoyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, no post-processing with CRFs
Model A2, 2 conv., fine+coarse, multi scale testing
more details
n/a57.868.548.890.542.051.950.747.062.8
FCN8s-QunjieYuyesyesnonononononononononononoAnonymous
more details
n/a34.553.228.184.99.418.213.322.046.6
RGB-D FCNyesyesyesyesnonoyesyesnonononononoAnonymousGoogLeNet + depth branch, single model
no data augmentation, no training on validation set, no graphical model
Used coarse labels to initialize depth branch
more details
n/a42.156.035.585.822.333.523.330.050.6
MultiBoostyesyesyesyesnonoyesyesnono22nonoAnonymousBoosting based solution.
Publication is under review.
more details
0.2532.541.225.877.811.223.624.921.833.4
GoogLeNet FCNyesyesnonononononononononononoGoing Deeper with ConvolutionsChristian Szegedy , Wei Liu , Yangqing Jia , Pierre Sermanet , Scott Reed , Dragomir Anguelov , Dumitru Erhan , Vincent Vanhoucke , Andrew RabinovichCVPR 2015GoogLeNet
No data augmentation, no graphical model
Trained by Lukas Schneider, following "Fully Convolutional Networks for Semantic Segmentation", Long et al. CVPR 2015
more details
n/a38.654.028.685.016.929.619.325.749.8
ERFNet (pretrained)yesyesnononononononono22noyesERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoTransactions on Intelligent Transportation Systems (T-ITS)ERFNet pretrained on ImageNet and trained only on the fine train (2975) annotated images


more details
0.0244.160.134.786.122.637.631.229.051.4
ERFNet (from scratch)yesyesnononononononono22noyesEfficient ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoIV2017ERFNet trained entirely on the fine train set (2975 images) without any pretraining nor coarse labels
more details
0.0240.456.731.584.919.435.125.024.346.6
TuSimple_CoarseyesyesyesyesnononononononononoyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison CottrellHere we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are better for practical use. First, we implement dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields of the network to aggregate global information; 2) alleviates what we call the "gridding issue" caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a new state-of-art result of 80.1% mIOU in the test set. We also are state-of-the-art overall on the KITTI road estimation benchmark and the
PASCAL VOC2012 segmentation task. Pretrained models are available at https://goo.gl/DQMeun.
more details
n/a56.967.647.389.238.352.554.844.860.9
SAC-multipleyesyesnonononononononononononoScale-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanInternational Conference on Computer Vision (ICCV) 2017
more details
n/a55.267.048.490.339.450.642.043.560.7
NetWarpyesyesyesyesnonononoyesyesnonononoAnonymous
more details
n/a59.569.752.390.841.955.752.751.461.8
depthAwareSeg_RNN_ffyesyesnononononononononononoyesAnonymoustraining with fine-annotated training images only (val set is not used); flip-augmentation only in training; single GPU for train&test; softmax loss; resnet101 as front end; multiscale test.
more details
n/a56.066.346.788.437.350.752.045.560.8
Ladder DenseNetyesyesnonononononononononononoAnonymousAnonymous ICCV submission 3205.
more details
0.4551.668.842.990.129.042.540.738.260.4
Real-time FCNyesyesyesyesnonononononononononoUnderstanding Cityscapes: Efficient Urban Semantic Scene UnderstandingMarius CordtsDissertationCombines the following concepts:
Network architecture: "Going deeper with convolutions". Szegedy et al., CVPR 2015
Framework and skip connections: "Fully convolutional networks for semantic segmentation". Long et al., CVPR 2015
Context modules: "Multi-scale context aggregation by dilated convolutions". Yu and Kolutin, ICLR 2016
more details
0.04445.559.236.185.225.640.035.632.749.7
GridNetyesyesnonononononononononononoAnonymousConv-Deconv Grid-Network for semantic segmentation.
Using only the training set without extra coarse annotated data (only 2975 images).
No pre-training (ImageNet).
No post-processing (like CRF).
more details
n/a44.157.336.485.621.437.829.332.052.7
PEARLyesyesnonononononoyesyesnonononoVideo Scene Parsing with Predictive Feature LearningXiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, and Shuicheng YanICCV 2017We proposed a novel Parsing with prEdictive feAtuRe Learning (PEARL) model to address the following two problems in video scene parsing: firstly, how to effectively learn meaningful video representations for producing the temporally consistent labeling maps; secondly, how to overcome the problem of insufficient labeled video training data, i.e. how to effectively conduct unsupervised deep learning. To our knowledge, this is the first model to employ predictive feature learning in the video scene parsing.
more details
n/a51.663.242.588.033.245.842.640.856.8
pruned & dilated inception-resnet-v2 (PD-IR2)yesyesyesyesnononononononononoyesAnonymous
more details
0.6942.153.135.382.920.836.532.128.547.4
PSPNetyesyesyesyesnononononononononoyesPyramid Scene Parsing NetworkHengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya JiaCVPR 2017This submission is trained on coarse+fine(train+val set, 2975+500 images).

Former submission is trained on coarse+fine(train set, 2975 images) which gets 80.2 mIoU: https://www.cityscapes-dataset.com/method-details/?submissionID=314

Previous versions of this method were listed as "SenseSeg_1026".
more details
n/a59.669.351.290.342.655.156.248.763.5
motovisyesyesyesyesnonononononononononomotovis.com
more details
n/a57.771.348.191.140.950.850.546.662.1
ML-CRNNyesyesnonononononononononononoMulti-level Contextual RNNs with Attention Model for Scene LabelingHeng Fan, Xue Mei, Danil Prokhorov, Haibin LingarXivA framework based on CNNs and RNNs is proposed, in which the RNNs are used to model spatial dependencies among image units. Besides, to enrich deep features, we use different features from multiple levels, and adopt a novel attention model to fuse them.
more details
n/a47.159.139.085.830.340.134.134.753.6
Hybrid ModelyesyesnonononononononononononoAnonymous
more details
n/a41.254.132.283.322.033.424.331.249.4
tek-IflyyesyesnonononononononononononoIflytekIflytek-yinusing a fusion strategy of three single models, the best result of a single model is 80.01%,multi-scale
more details
n/a60.169.948.390.144.655.457.750.264.5
GridNetyesyesnononononononononononoyesResidual Conv-Deconv Grid Network for Semantic SegmentationDamien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Tremeau & Christian WolfBMVC 2017We used a new architecture for semantic image segmentation called GridNet, following a grid pattern allowing multiple interconnected streams to work at different resolutions (see paper).
We used only the training set without extra coarse annotated data (only 2975 images) and no pre-training (ImageNet) nor pre or post-processing.
more details
n/a44.557.737.185.922.038.829.232.053.2
firenetyesyesnononononononono22nonoAnonymous
more details
n/a47.864.940.185.927.640.531.737.454.1
DeepLabv3yesyesyesyesnonononononononononoRethinking Atrous Convolution for Semantic Image SegmentationLiang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig AdamarXiv preprintIn this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects
at multiple scales, we employ a module, called Atrous Spatial Pyrmid Pooling (ASPP), which adopts atrous convolution in parallel to capture multi-scale context with multiple atrous rates. Furthermore, we propose to augment ASPP module with image-level features encoding global context and further boost performance.
Results obtained with a single model (no ensemble), trained with fine + coarse annotations. More details will be shown in the updated arXiv report.
more details
n/a62.172.953.991.246.057.855.952.965.9
EdgeSenseSegyesyesnonononononononononononoAnonymousDeep segmentation network with hard negative mining and other tricks.
more details
n/a57.168.449.188.841.853.246.946.762.3
iFLYTEK-CVyesyesyesyesnonononononononononoIFLYTEK RESEARCHIFLYTEK CV Group - YinLinBoth fine(train&val) and coarse data were used to train a novel segmentation framework.
more details
n/a60.969.649.390.046.556.858.950.965.0
ScaleNetyesyesyesyesnonononononononononoScaleNet: Scale Invariant Network for Semantic Segmentation in Urban Driving ScenesMohammad Dawud Ansari, Stephan Krarß, Oliver Wasenmüller and Didier StrickerInternational Conference on Computer Vision Theory and Applications, Funchal, Portugal, 2018The scale difference in driving scenarios is one of the essential challenges in semantic scene segmentation.
Close objects cover significantly more pixels than far objects. In this paper, we address this challenge with a
scale invariant architecture. Within this architecture, we explicitly estimate the depth and adapt the pooling
field size accordingly. Our model is compact and can be extended easily to other research domains. Finally,
the accuracy of our approach is comparable to the state-of-the-art and superior for scale problems. We evaluate
on the widely used automotive dataset Cityscapes as well as a self-recorded dataset.
more details
n/a53.165.644.788.634.647.242.443.258.6
K-netyesyesnonononononononononononoXinLiang Zhong
more details
n/a52.862.846.788.337.051.742.638.655.0
arsaityesyesnonononononononononononoanonymousanonymousanonymous
more details
0.433.643.621.481.018.025.125.117.736.9
MSNETyesyesnonononononononononononoAnonymouspreviously also listed as "MultiPathJoin" and "MultiPath_Scale".
more details
0.257.174.251.489.737.652.838.346.666.6
Multitask LearningyesyesnonononononononononononoMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaNumerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
n/a57.466.949.389.440.050.854.147.261.8
DeepMotionyesyesnonononononononononononoAnonymousWe propose a novel method based on convnets to extract multi-scale features in a large range particularly for solving street scene segmentation.
more details
n/a58.667.549.289.944.255.255.646.360.9
SR-AICyesyesyesyesnonononononononononoAnonymous
more details
n/a60.769.351.390.747.057.754.250.964.2
Roadstar.ai_CV(SFNet)yesyesnonononononononononononoRoadstar.ai-CVMaosheng Ye, Guang Zhou, Tongyi Cao, YongTao Huang, Yinzi Chensame foucs net(SFNet), based on only fine labels, with focus on the loss distribution and same focus on the every layer of feature map
more details
0.260.875.852.289.948.959.544.847.867.8
Mapillary Research: In-Place Activated BatchNormyesyesyesyesnononononononononoyesIn-Place Activated BatchNorm for Memory-Optimized Training of DNNsSamuel Rota Bulò, Lorenzo Porzi, Peter KontschiederarXivIn-Place Activated Batch Normalization (InPlace-ABN) is a novel approach to drastically reduce the training memory footprint of modern deep neural networks in a computationally efficient way. Our solution substitutes the conventionally used succession of BatchNorm + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. We obtain memory savings of up to 50% by dropping intermediate results and by recovering required information during the backward pass through the inversion of stored forward results, with only minor increase (0.8-2%) in computation time. Test results are obtained using a single model.
more details
n/a65.973.654.290.155.965.366.556.465.6

 
 

IoU on category-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averageflatnatureobjectskyconstructionhumanvehicle
FCN 8syesyesnononononononononononoyesFully Convolutional Networks for Semantic SegmentationJ. Long, E. Shelhamer, and T. DarrellCVPR 2015Trained by Marius Cordts on a pre-release version of the dataset
more details
0.585.798.291.157.093.989.678.691.3
RRR-ResNet152-MultiScaleyesyesyesyesnonononononononononoAnonymousupdate: this submission actually used the coarse labels, which was previously not marked accordingly
more details
n/a89.398.592.569.095.092.383.294.3
Dilation10yesyesnononononononononononoyesMulti-Scale Context Aggregation by Dilated ConvolutionsFisher Yu and Vladlen KoltunICLR 2016Dilation10 is a convolutional network that consists of a front-end prediction module and a context aggregation module. Both are described in the paper. The combined network was trained jointly. The context module consists of 10 layers, each of which has C=19 feature maps. The larger number of layers in the context module (10 for Cityscapes versus 8 for Pascal VOC) is due to the high input resolution. The Dilation10 model is a pure convolutional network: there is no CRF and no structured prediction. Dilation10 can therefore be used as the baseline input for structured prediction models. Note that the reported results were produced by training on the training set only; the network was not retrained on train+val.
more details
4.086.598.391.460.593.790.279.891.8
AdelaideyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationG. Lin, C. Shen, I. Reid, and A. van den HengelarXiv preprint 2015Trained on a pre-release version of the dataset
more details
35.082.897.889.748.292.288.773.189.6
DeepLab LargeFOV StrongWeakyesyesyesyesnononononono22noyesWeakly- and Semi-Supervised Learning of a DCNN for Semantic Image SegmentationG. Papandreou, L.-C. Chen, K. Murphy, and A. L. YuilleICCV 2015Trained on a pre-release version of the dataset
more details
4.081.397.889.040.492.888.270.990.0
DeepLab LargeFOV Strongyesyesnononononononono22noyesSemantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFsL.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. YuilleICLR 2015Trained on a pre-release version of the dataset
more details
4.081.297.889.040.492.788.071.089.7
DPNyesyesyesyesnononononono33nonoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015Trained on a pre-release version of the dataset
more details
n/a79.597.388.037.793.986.665.487.7
Segnet basicyesyesnononononononono44noyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0679.197.486.742.591.883.864.787.2
Segnet extendedyesyesnononononononono44noyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0679.897.587.143.791.782.868.687.5
CRFasRNNyesyesnononononononono22noyesConditional Random Fields as Recurrent Neural NetworksS. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. TorrICCV 2015Trained on a pre-release version of the dataset
more details
0.782.797.790.346.593.588.573.688.9
Scale invariant CNN + CRFyesyesnonononoyesyesnononononoyesConvolutional Scale Invariance for Semantic SegmentationI. Kreso, D. Causevic, J. Krapac, and S. SegvicGCPR 2016We propose an effective technique to address large scale variation in images taken from a moving car by cross-breeding deep learning with stereo reconstruction. Our main contribution is a novel scale selection layer which extracts convolutional features at the scale which matches the corresponding reconstructed depth. The recovered scaleinvariant representation disentangles appearance from scale and frees the pixel-level classifier from the need to learn the laws of the perspective. This results in improved segmentation results due to more effi- cient exploitation of representation capacity and training data. We perform experiments on two challenging stereoscopic datasets (KITTI and Cityscapes) and report competitive class-level IoU performance.
more details
n/a85.097.290.259.992.289.078.288.4
DPNyesyesnonononononononononononoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015DPN trained on full resolution images
more details
n/a86.098.291.158.994.589.878.491.2
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a85.998.290.859.393.589.279.291.1
Adelaide_contextyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationGuosheng Lin, Chunhua Shen, Anton van den Hengel, Ian ReidCVPR 2016We explore contextual information to improve semantic image segmentation. Details are described in the paper. We trained contextual networks for coarse level prediction and a refinement network for refining the coarse prediction. Our models are trained on the training set only (2975 images) without adding the validation set.
more details
n/a87.398.491.760.894.190.982.093.3
NVSegNetyesyesnonononononononononononoAnonymousIn the inference, we use the image of 2 different scales. The same for training!
more details
0.487.298.491.663.594.690.580.292.0
ENetyesyesnononononononono22noyesENet: A Deep Neural Network Architecture for Real-Time Semantic SegmentationAdam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello
more details
0.01380.497.388.346.890.685.465.588.9
DeepLabv2-CRFyesyesnononononononononononoyesDeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFsLiang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. YuillearXiv preprintDeepLabv2-CRF is based on three main methods. First, we employ convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool to repurpose ResNet-101 (trained on image classification task) in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within DCNNs. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and fully connected Conditional Random Fields (CRFs). The model is only trained on train set.
more details
n/a86.498.391.557.394.290.880.292.6
m-TCFsyesyesyesyesnonononononononononoAnonymousConvolutional Neural Network
more details
1.087.698.491.963.294.591.580.393.2
DeepLab+DynamicCRFyesyesnonononononononononononoru.nl
more details
n/a83.797.389.448.293.688.877.191.1
LRR-4xyesyesnononononononononononoyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained on the training set (2975 images). The segmentation predictions were not post-processed using CRF. (This is a revision of a previous submission in which we didn't use the correct basis functions; the method name changed from 'LLR-4x' to 'LRR-4x')
more details
n/a88.298.492.266.294.791.182.492.5
LRR-4xyesyesyesyesnononononononononoyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained using both coarse and fine annotations. The segmentation predictions were not post-processed using CRF.
more details
n/a88.498.492.266.995.091.581.993.1
Le_Selfdriving_VGGyesyesnonononononononononononoAnonymous
more details
n/a84.498.089.954.793.488.975.290.7
SQyesyesnonononononononononononoSpeeding up Semantic Segmentation for Autonomous DrivingMichael Treml, José Arjona-Medina, Thomas Unterthiner, Rupesh Durgesh, Felix Friedmann, Peter Schuberth, Andreas Mayr, Martin Heusel, Markus Hofmarcher, Michael Widrich, Bernhard Nessler, Sepp HochreiterNIPS 2016 Workshop - MLITS Machine Learning for Intelligent Transportation Systems Neural Information Processing Systems 2016, Barcelona, Spain
more details
0.0684.396.790.457.093.087.575.689.9
SAITyesyesyesyesnonononononononononoAnonymousAnonymous
more details
4.089.698.692.969.195.292.883.694.6
FoveaNetyesyesnonononononononononononoFoveaNetXin Li, Jiashi Feng1.caffe-master
2.resnet-101
3.single scale testing

Previously listed as "LXFCRN".
more details
n/a89.398.592.469.894.591.984.393.4
RefineNetyesyesnononononononononononoyesRefineNet: Multi-Path Refinement Networks for High-Resolution Semantic SegmentationGuosheng Lin; Anton Milan; Chunhua Shen; Ian Reid;Please refer to our technical report for details: "RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation" (https://arxiv.org/abs/1611.06612). Our source code is available at: https://github.com/guosheng/refinenet
2975 images (training set with fine labels) are used for training.
more details
n/a87.998.491.963.894.891.781.393.6
SegModelyesyesnonononononononononononoAnonymousBoth train set (2975) and val set (500) are used to train model for this submission.
more details
0.889.898.693.068.195.593.385.295.0
TuSimpleyesyesnononononononononononoyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison Cottrell
more details
n/a90.198.693.071.995.293.084.894.5
Global-Local-RefinementyesyesnonononononononononononoGlobal-residual and Local-boundary Refinement Networks for Rectifying Scene Parsing PredictionsRui Zhang, Sheng Tang, Min Lin, Jintao Li, Shuicheng YanInternational Joint Conference on Artificial Intelligence (IJCAI) 2017global-residual and local-boundary refinement

The method was previously listed as "RefineNet". To avoid confusions with a recently appeared and similarly named approach, the submission name was updated.
more details
n/a90.098.793.070.195.493.185.294.8
XPARSEyesyesnonononononononononononoAnonymous
more details
n/a88.798.592.466.595.291.982.893.7
ResNet-38yesyesnononononononononononoyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, single scale, no post-processing with CRFs
Model A2, 2 conv., fine only, single scale testing

The submissions was previously listed as "Model A2, 2 conv.". The name was changed for consistency with the other submission of the same work.
more details
n/a90.998.793.473.495.593.587.095.1
SegModelyesyesyesyesnonononononononononoAnonymous
more details
n/a90.498.793.271.295.593.585.395.2
Deep Layer Cascade (LC)yesyesnonononononononononononoNot All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer CascadeXiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou TangCVPR 2017We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and
more details
n/a88.198.492.164.594.291.582.593.5
FRRNyesyesnononononononono22noyesFull-Resolution Residual Networks for Semantic Segmentation in Street ScenesTobias Pohlen, Alexander Hermans, Markus Mathias, Bastian LeibeArxivFull-Resolution Residual Networks (FRRN) combine multi-scale context with pixel-level accuracy by using two processing streams within one network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition.
more details
n/a88.998.592.368.494.991.882.593.8
MNet_MPRGyesyesnonononononononononononoChubu University, MPRGwithout val dataset, external dataset (e.g. image net) and post-processing
more details
0.689.398.592.470.494.791.983.393.6
ResNet-38yesyesyesyesnononononononononoyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, no post-processing with CRFs
Model A2, 2 conv., fine+coarse, multi scale testing
more details
n/a91.098.793.473.695.593.686.995.5
FCN8s-QunjieYuyesyesnonononononononononononoAnonymous
more details
n/a81.896.591.235.093.388.377.990.1
RGB-D FCNyesyesyesyesnonoyesyesnonononononoAnonymousGoogLeNet + depth branch, single model
no data augmentation, no training on validation set, no graphical model
Used coarse labels to initialize depth branch
more details
n/a87.598.491.664.394.790.980.392.3
MultiBoostyesyesyesyesnonoyesyesnono22nonoAnonymousBoosting based solution.
Publication is under review.
more details
0.2581.997.588.750.990.387.171.487.3
GoogLeNet FCNyesyesnonononononononononononoGoing Deeper with ConvolutionsChristian Szegedy , Wei Liu , Yangqing Jia , Pierre Sermanet , Scott Reed , Dragomir Anguelov , Dumitru Erhan , Vincent Vanhoucke , Andrew RabinovichCVPR 2015GoogLeNet
No data augmentation, no graphical model
Trained by Lukas Schneider, following "Fully Convolutional Networks for Semantic Segmentation", Long et al. CVPR 2015
more details
n/a85.898.290.958.693.789.578.491.2
ERFNet (pretrained)yesyesnononononononono22noyesERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoTransactions on Intelligent Transportation Systems (T-ITS)ERFNet pretrained on ImageNet and trained only on the fine train (2975) annotated images


more details
0.0287.398.291.565.194.290.678.992.3
ERFNet (from scratch)yesyesnononononononono22noyesEfficient ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoIV2017ERFNet trained entirely on the fine train set (2975 images) without any pretraining nor coarse labels
more details
0.0286.598.291.162.494.290.177.491.9
TuSimple_CoarseyesyesyesyesnononononononononoyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison CottrellHere we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are better for practical use. First, we implement dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields of the network to aggregate global information; 2) alleviates what we call the "gridding issue" caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a new state-of-art result of 80.1% mIOU in the test set. We also are state-of-the-art overall on the KITTI road estimation benchmark and the
PASCAL VOC2012 segmentation task. Pretrained models are available at https://goo.gl/DQMeun.
more details
n/a90.798.793.173.195.493.486.095.4
SAC-multipleyesyesnonononononononononononoScale-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanInternational Conference on Computer Vision (ICCV) 2017
more details
n/a90.698.793.171.895.693.486.195.3
NetWarpyesyesyesyesnonononoyesyesnonononoAnonymous
more details
n/a91.098.693.274.895.393.586.695.3
depthAwareSeg_RNN_ffyesyesnononononononononononoyesAnonymoustraining with fine-annotated training images only (val set is not used); flip-augmentation only in training; single GPU for train&test; softmax loss; resnet101 as front end; multiscale test.
more details
n/a89.798.692.868.394.893.085.595.0
Ladder DenseNetyesyesnonononononononononononoAnonymousAnonymous ICCV submission 3205.
more details
0.4589.798.392.171.195.592.384.593.9
Real-time FCNyesyesyesyesnonononononononononoUnderstanding Cityscapes: Efficient Urban Semantic Scene UnderstandingMarius CordtsDissertationCombines the following concepts:
Network architecture: "Going deeper with convolutions". Szegedy et al., CVPR 2015
Framework and skip connections: "Fully convolutional networks for semantic segmentation". Long et al., CVPR 2015
Context modules: "Multi-scale context aggregation by dilated convolutions". Yu and Kolutin, ICLR 2016
more details
0.04487.998.491.564.394.791.481.693.7
GridNetyesyesnonononononononononononoAnonymousConv-Deconv Grid-Network for semantic segmentation.
Using only the training set without extra coarse annotated data (only 2975 images).
No pre-training (ImageNet).
No post-processing (like CRF).
more details
n/a87.998.492.165.593.890.981.892.5
PEARLyesyesnonononononoyesyesnonononoVideo Scene Parsing with Predictive Feature LearningXiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, and Shuicheng YanICCV 2017We proposed a novel Parsing with prEdictive feAtuRe Learning (PEARL) model to address the following two problems in video scene parsing: firstly, how to effectively learn meaningful video representations for producing the temporally consistent labeling maps; secondly, how to overcome the problem of insufficient labeled video training data, i.e. how to effectively conduct unsupervised deep learning. To our knowledge, this is the first model to employ predictive feature learning in the video scene parsing.
more details
n/a89.298.592.667.695.292.383.794.2
pruned & dilated inception-resnet-v2 (PD-IR2)yesyesyesyesnononononononononoyesAnonymous
more details
0.6986.598.391.061.994.490.278.491.6
PSPNetyesyesyesyesnononononononononoyesPyramid Scene Parsing NetworkHengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya JiaCVPR 2017This submission is trained on coarse+fine(train+val set, 2975+500 images).

Former submission is trained on coarse+fine(train set, 2975 images) which gets 80.2 mIoU: https://www.cityscapes-dataset.com/method-details/?submissionID=314

Previous versions of this method were listed as "SenseSeg_1026".
more details
n/a91.298.793.374.295.393.887.195.7
motovisyesyesyesyesnonononononononononomotovis.com
more details
n/a91.598.793.475.495.893.887.395.8
ML-CRNNyesyesnonononononononononononoMulti-level Contextual RNNs with Attention Model for Scene LabelingHeng Fan, Xue Mei, Danil Prokhorov, Haibin LingarXivA framework based on CNNs and RNNs is proposed, in which the RNNs are used to model spatial dependencies among image units. Besides, to enrich deep features, we use different features from multiple levels, and adopt a novel attention model to fuse them.
more details
n/a87.798.391.864.794.691.280.792.7
Hybrid ModelyesyesnonononononononononononoAnonymous
more details
n/a85.298.190.856.891.889.678.191.1
tek-IflyyesyesnonononononononononononoIflytekIflytek-yinusing a fusion strategy of three single models, the best result of a single model is 80.01%,multi-scale
more details
n/a90.998.793.472.995.693.786.695.6
GridNetyesyesnononononononononononoyesResidual Conv-Deconv Grid Network for Semantic SegmentationDamien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Tremeau & Christian WolfBMVC 2017We used a new architecture for semantic image segmentation called GridNet, following a grid pattern allowing multiple interconnected streams to work at different resolutions (see paper).
We used only the training set without extra coarse annotated data (only 2975 images) and no pre-training (ImageNet) nor pre or post-processing.
more details
n/a88.198.492.166.293.891.182.392.6
firenetyesyesnononononononono22nonoAnonymous
more details
n/a84.995.089.860.692.187.277.691.7
DeepLabv3yesyesyesyesnonononononononononoRethinking Atrous Convolution for Semantic Image SegmentationLiang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig AdamarXiv preprintIn this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects
at multiple scales, we employ a module, called Atrous Spatial Pyrmid Pooling (ASPP), which adopts atrous convolution in parallel to capture multi-scale context with multiple atrous rates. Furthermore, we propose to augment ASPP module with image-level features encoding global context and further boost performance.
Results obtained with a single model (no ensemble), trained with fine + coarse annotations. More details will be shown in the updated arXiv report.
more details
n/a91.698.793.576.095.993.987.995.7
EdgeSenseSegyesyesnonononononononononononoAnonymousDeep segmentation network with hard negative mining and other tricks.
more details
n/a89.898.692.968.995.092.985.694.5
iFLYTEK-CVyesyesyesyesnonononononononononoIFLYTEK RESEARCHIFLYTEK CV Group - YinLinBoth fine(train&val) and coarse data were used to train a novel segmentation framework.
more details
n/a91.098.793.373.795.693.786.695.6
ScaleNetyesyesyesyesnonononononononononoScaleNet: Scale Invariant Network for Semantic Segmentation in Urban Driving ScenesMohammad Dawud Ansari, Stephan Krarß, Oliver Wasenmüller and Didier StrickerInternational Conference on Computer Vision Theory and Applications, Funchal, Portugal, 2018The scale difference in driving scenarios is one of the essential challenges in semantic scene segmentation.
Close objects cover significantly more pixels than far objects. In this paper, we address this challenge with a
scale invariant architecture. Within this architecture, we explicitly estimate the depth and adapt the pooling
field size accordingly. Our model is compact and can be extended easily to other research domains. Finally,
the accuracy of our approach is comparable to the state-of-the-art and superior for scale problems. We evaluate
on the widely used automotive dataset Cityscapes as well as a self-recorded dataset.
more details
n/a89.698.592.869.994.692.984.194.4
K-netyesyesnonononononononononononoXinLiang Zhong
more details
n/a88.898.592.366.494.692.483.094.4
arsaityesyesnonononononononononononoanonymousanonymousanonymous
more details
0.482.097.989.246.293.187.969.890.0
MSNETyesyesnonononononononononononoAnonymouspreviously also listed as "MultiPathJoin" and "MultiPath_Scale".
more details
0.290.698.693.173.594.993.086.294.8
Multitask LearningyesyesnonononononononononononoMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaNumerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
n/a89.998.593.069.895.193.285.394.7
DeepMotionyesyesnonononononononononononoAnonymousWe propose a novel method based on convnets to extract multi-scale features in a large range particularly for solving street scene segmentation.
more details
n/a90.798.793.372.195.493.686.295.5
SR-AICyesyesyesyesnonononononononononoAnonymous
more details
n/a91.398.793.475.195.594.086.995.7
Roadstar.ai_CV(SFNet)yesyesnonononononononononononoRoadstar.ai-CVMaosheng Ye, Guang Zhou, Tongyi Cao, YongTao Huang, Yinzi Chensame foucs net(SFNet), based on only fine labels, with focus on the loss distribution and same focus on the every layer of feature map
more details
0.291.098.793.473.895.393.587.095.3
Mapillary Research: In-Place Activated BatchNormyesyesyesyesnononononononononoyesIn-Place Activated BatchNorm for Memory-Optimized Training of DNNsSamuel Rota Bulò, Lorenzo Porzi, Peter KontschiederarXivIn-Place Activated Batch Normalization (InPlace-ABN) is a novel approach to drastically reduce the training memory footprint of modern deep neural networks in a computationally efficient way. Our solution substitutes the conventionally used succession of BatchNorm + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. We obtain memory savings of up to 50% by dropping intermediate results and by recovering required information during the backward pass through the inversion of stored forward results, with only minor increase (0.8-2%) in computation time. Test results are obtained using a single model.
more details
n/a91.298.693.374.495.693.886.995.4

 
 

iIoU on category-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averagehumanvehicle
FCN 8syesyesnononononononononononoyesFully Convolutional Networks for Semantic SegmentationJ. Long, E. Shelhamer, and T. DarrellCVPR 2015Trained by Marius Cordts on a pre-release version of the dataset
more details
0.570.158.082.3
RRR-ResNet152-MultiScaleyesyesyesyesnonononononononononoAnonymousupdate: this submission actually used the coarse labels, which was previously not marked accordingly
more details
n/a74.061.886.1
Dilation10yesyesnononononononononononoyesMulti-Scale Context Aggregation by Dilated ConvolutionsFisher Yu and Vladlen KoltunICLR 2016Dilation10 is a convolutional network that consists of a front-end prediction module and a context aggregation module. Both are described in the paper. The combined network was trained jointly. The context module consists of 10 layers, each of which has C=19 feature maps. The larger number of layers in the context module (10 for Cityscapes versus 8 for Pascal VOC) is due to the high input resolution. The Dilation10 model is a pure convolutional network: there is no CRF and no structured prediction. Dilation10 can therefore be used as the baseline input for structured prediction models. Note that the reported results were produced by training on the training set only; the network was not retrained on train+val.
more details
4.071.158.383.9
AdelaideyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationG. Lin, C. Shen, I. Reid, and A. van den HengelarXiv preprint 2015Trained on a pre-release version of the dataset
more details
35.067.458.276.7
DeepLab LargeFOV StrongWeakyesyesyesyesnononononono22noyesWeakly- and Semi-Supervised Learning of a DCNN for Semantic Image SegmentationG. Papandreou, L.-C. Chen, K. Murphy, and A. L. YuilleICCV 2015Trained on a pre-release version of the dataset
more details
4.058.741.475.9
DeepLab LargeFOV Strongyesyesnononononononono22noyesSemantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFsL.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. YuilleICLR 2015Trained on a pre-release version of the dataset
more details
4.058.741.376.1
DPNyesyesyesyesnononononono33nonoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015Trained on a pre-release version of the dataset
more details
n/a57.939.976.0
Segnet basicyesyesnononononononono44noyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0661.947.076.8
Segnet extendedyesyesnononononononono44noyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0666.451.980.9
CRFasRNNyesyesnononononononono22noyesConditional Random Fields as Recurrent Neural NetworksS. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. TorrICCV 2015Trained on a pre-release version of the dataset
more details
0.766.053.478.6
Scale invariant CNN + CRFyesyesnonononoyesyesnononononoyesConvolutional Scale Invariance for Semantic SegmentationI. Kreso, D. Causevic, J. Krapac, and S. SegvicGCPR 2016We propose an effective technique to address large scale variation in images taken from a moving car by cross-breeding deep learning with stereo reconstruction. Our main contribution is a novel scale selection layer which extracts convolutional features at the scale which matches the corresponding reconstructed depth. The recovered scaleinvariant representation disentangles appearance from scale and frees the pixel-level classifier from the need to learn the laws of the perspective. This results in improved segmentation results due to more effi- cient exploitation of representation capacity and training data. We perform experiments on two challenging stereoscopic datasets (KITTI and Cityscapes) and report competitive class-level IoU performance.
more details
n/a71.260.681.7
DPNyesyesnonononononononononononoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015DPN trained on full resolution images
more details
n/a69.155.083.1
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a73.962.685.2
Adelaide_contextyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationGuosheng Lin, Chunhua Shen, Anton van den Hengel, Ian ReidCVPR 2016We explore contextual information to improve semantic image segmentation. Details are described in the paper. We trained contextual networks for coarse level prediction and a refinement network for refining the coarse prediction. Our models are trained on the training set only (2975 images) without adding the validation set.
more details
n/a74.163.185.1
NVSegNetyesyesnonononononononononononoAnonymousIn the inference, we use the image of 2 different scales. The same for training!
more details
0.468.153.582.7
ENetyesyesnononononononono22noyesENet: A Deep Neural Network Architecture for Real-Time Semantic SegmentationAdam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello
more details
0.01364.049.378.7
DeepLabv2-CRFyesyesnononononononononononoyesDeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFsLiang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. YuillearXiv preprintDeepLabv2-CRF is based on three main methods. First, we employ convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool to repurpose ResNet-101 (trained on image classification task) in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within DCNNs. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and fully connected Conditional Random Fields (CRFs). The model is only trained on train set.
more details
n/a67.752.582.9
m-TCFsyesyesyesyesnonononononononononoAnonymousConvolutional Neural Network
more details
1.070.657.084.1
DeepLab+DynamicCRFyesyesnonononononononononononoru.nl
more details
n/a62.445.879.0
LRR-4xyesyesnononononononononononoyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained on the training set (2975 images). The segmentation predictions were not post-processed using CRF. (This is a revision of a previous submission in which we didn't use the correct basis functions; the method name changed from 'LLR-4x' to 'LRR-4x')
more details
n/a74.763.386.2
LRR-4xyesyesyesyesnononononononononoyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained using both coarse and fine annotations. The segmentation predictions were not post-processed using CRF.
more details
n/a73.962.785.0
Le_Selfdriving_VGGyesyesnonononononononononononoAnonymous
more details
n/a64.350.078.5
SQyesyesnonononononononononononoSpeeding up Semantic Segmentation for Autonomous DrivingMichael Treml, José Arjona-Medina, Thomas Unterthiner, Rupesh Durgesh, Felix Friedmann, Peter Schuberth, Andreas Mayr, Martin Heusel, Markus Hofmarcher, Michael Widrich, Bernhard Nessler, Sepp HochreiterNIPS 2016 Workshop - MLITS Machine Learning for Intelligent Transportation Systems Neural Information Processing Systems 2016, Barcelona, Spain
more details
0.0666.050.082.0
SAITyesyesyesyesnonononononononononoAnonymousAnonymous
more details
4.075.564.386.7
FoveaNetyesyesnonononononononononononoFoveaNetXin Li, Jiashi Feng1.caffe-master
2.resnet-101
3.single scale testing

Previously listed as "LXFCRN".
more details
n/a77.668.386.9
RefineNetyesyesnononononononononononoyesRefineNet: Multi-Path Refinement Networks for High-Resolution Semantic SegmentationGuosheng Lin; Anton Milan; Chunhua Shen; Ian Reid;Please refer to our technical report for details: "RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation" (https://arxiv.org/abs/1611.06612). Our source code is available at: https://github.com/guosheng/refinenet
2975 images (training set with fine labels) are used for training.
more details
n/a70.656.884.5
SegModelyesyesnonononononononononononoAnonymousBoth train set (2975) and val set (500) are used to train model for this submission.
more details
0.875.964.287.6
TuSimpleyesyesnononononononononononoyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison Cottrell
more details
n/a75.264.086.5
Global-Local-RefinementyesyesnonononononononononononoGlobal-residual and Local-boundary Refinement Networks for Rectifying Scene Parsing PredictionsRui Zhang, Sheng Tang, Min Lin, Jintao Li, Shuicheng YanInternational Joint Conference on Artificial Intelligence (IJCAI) 2017global-residual and local-boundary refinement

The method was previously listed as "RefineNet". To avoid confusions with a recently appeared and similarly named approach, the submission name was updated.
more details
n/a76.866.786.9
XPARSEyesyesnonononononononononononoAnonymous
more details
n/a74.263.085.4
ResNet-38yesyesnononononononononononoyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, single scale, no post-processing with CRFs
Model A2, 2 conv., fine only, single scale testing

The submissions was previously listed as "Model A2, 2 conv.". The name was changed for consistency with the other submission of the same work.
more details
n/a81.173.289.0
SegModelyesyesyesyesnonononononononononoAnonymous
more details
n/a77.066.287.9
Deep Layer Cascade (LC)yesyesnonononononononononononoNot All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer CascadeXiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou TangCVPR 2017We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and
more details
n/a74.162.086.2
FRRNyesyesnononononononono22noyesFull-Resolution Residual Networks for Semantic Segmentation in Street ScenesTobias Pohlen, Alexander Hermans, Markus Mathias, Bastian LeibeArxivFull-Resolution Residual Networks (FRRN) combine multi-scale context with pixel-level accuracy by using two processing streams within one network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition.
more details
n/a75.164.985.4
MNet_MPRGyesyesnonononononononononononoChubu University, MPRGwithout val dataset, external dataset (e.g. image net) and post-processing
more details
0.677.968.687.1
ResNet-38yesyesyesyesnononononononononoyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, no post-processing with CRFs
Model A2, 2 conv., fine+coarse, multi scale testing
more details
n/a79.169.688.5
FCN8s-QunjieYuyesyesnonononononononononononoAnonymous
more details
n/a68.755.681.7
RGB-D FCNyesyesyesyesnonoyesyesnonononononoAnonymousGoogLeNet + depth branch, single model
no data augmentation, no training on validation set, no graphical model
Used coarse labels to initialize depth branch
more details
n/a71.058.083.9
MultiBoostyesyesyesyesnonoyesyesnono22nonoAnonymousBoosting based solution.
Publication is under review.
more details
0.2560.245.075.5
GoogLeNet FCNyesyesnonononononononononononoGoing Deeper with ConvolutionsChristian Szegedy , Wei Liu , Yangqing Jia , Pierre Sermanet , Scott Reed , Dragomir Anguelov , Dumitru Erhan , Vincent Vanhoucke , Andrew RabinovichCVPR 2015GoogLeNet
No data augmentation, no graphical model
Trained by Lukas Schneider, following "Fully Convolutional Networks for Semantic Segmentation", Long et al. CVPR 2015
more details
n/a69.856.383.3
ERFNet (pretrained)yesyesnononononononono22noyesERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoTransactions on Intelligent Transportation Systems (T-ITS)ERFNet pretrained on ImageNet and trained only on the fine train (2975) annotated images


more details
0.0272.761.284.1
ERFNet (from scratch)yesyesnononononononono22noyesEfficient ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoIV2017ERFNet trained entirely on the fine train set (2975 images) without any pretraining nor coarse labels
more details
0.0270.458.082.8
TuSimple_CoarseyesyesyesyesnononononononononoyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison CottrellHere we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are better for practical use. First, we implement dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields of the network to aggregate global information; 2) alleviates what we call the "gridding issue" caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a new state-of-art result of 80.1% mIOU in the test set. We also are state-of-the-art overall on the KITTI road estimation benchmark and the
PASCAL VOC2012 segmentation task. Pretrained models are available at https://goo.gl/DQMeun.
more details
n/a77.868.687.1
SAC-multipleyesyesnonononononononononononoScale-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanInternational Conference on Computer Vision (ICCV) 2017
more details
n/a78.368.488.2
NetWarpyesyesyesyesnonononoyesyesnonononoAnonymous
more details
n/a79.870.788.9
depthAwareSeg_RNN_ffyesyesnononononononononononoyesAnonymoustraining with fine-annotated training images only (val set is not used); flip-augmentation only in training; single GPU for train&test; softmax loss; resnet101 as front end; multiscale test.
more details
n/a76.967.486.5
Ladder DenseNetyesyesnonononononononononononoAnonymousAnonymous ICCV submission 3205.
more details
0.4579.570.488.6
Real-time FCNyesyesyesyesnonononononononononoUnderstanding Cityscapes: Efficient Urban Semantic Scene UnderstandingMarius CordtsDissertationCombines the following concepts:
Network architecture: "Going deeper with convolutions". Szegedy et al., CVPR 2015
Framework and skip connections: "Fully convolutional networks for semantic segmentation". Long et al., CVPR 2015
Context modules: "Multi-scale context aggregation by dilated convolutions". Yu and Kolutin, ICLR 2016
more details
0.04471.660.582.7
GridNetyesyesnonononononononononononoAnonymousConv-Deconv Grid-Network for semantic segmentation.
Using only the training set without extra coarse annotated data (only 2975 images).
No pre-training (ImageNet).
No post-processing (like CRF).
more details
n/a71.158.384.0
PEARLyesyesnonononononoyesyesnonononoVideo Scene Parsing with Predictive Feature LearningXiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, and Shuicheng YanICCV 2017We proposed a novel Parsing with prEdictive feAtuRe Learning (PEARL) model to address the following two problems in video scene parsing: firstly, how to effectively learn meaningful video representations for producing the temporally consistent labeling maps; secondly, how to overcome the problem of insufficient labeled video training data, i.e. how to effectively conduct unsupervised deep learning. To our knowledge, this is the first model to employ predictive feature learning in the video scene parsing.
more details
n/a75.164.385.9
pruned & dilated inception-resnet-v2 (PD-IR2)yesyesyesyesnononononononononoyesAnonymous
more details
0.6968.355.381.2
PSPNetyesyesyesyesnononononononononoyesPyramid Scene Parsing NetworkHengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya JiaCVPR 2017This submission is trained on coarse+fine(train+val set, 2975+500 images).

Former submission is trained on coarse+fine(train set, 2975 images) which gets 80.2 mIoU: https://www.cityscapes-dataset.com/method-details/?submissionID=314

Previous versions of this method were listed as "SenseSeg_1026".
more details
n/a79.270.288.2
motovisyesyesyesyesnonononononononononomotovis.com
more details
n/a80.772.389.0
ML-CRNNyesyesnonononononononononononoMulti-level Contextual RNNs with Attention Model for Scene LabelingHeng Fan, Xue Mei, Danil Prokhorov, Haibin LingarXivA framework based on CNNs and RNNs is proposed, in which the RNNs are used to model spatial dependencies among image units. Besides, to enrich deep features, we use different features from multiple levels, and adopt a novel attention model to fuse them.
more details
n/a72.560.984.1
Hybrid ModelyesyesnonononononononononononoAnonymous
more details
n/a68.555.681.5
tek-IflyyesyesnonononononononononononoIflytekIflytek-yinusing a fusion strategy of three single models, the best result of a single model is 80.01%,multi-scale
more details
n/a79.670.788.4
GridNetyesyesnononononononononononoyesResidual Conv-Deconv Grid Network for Semantic SegmentationDamien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Tremeau & Christian WolfBMVC 2017We used a new architecture for semantic image segmentation called GridNet, following a grid pattern allowing multiple interconnected streams to work at different resolutions (see paper).
We used only the training set without extra coarse annotated data (only 2975 images) and no pre-training (ImageNet) nor pre or post-processing.
more details
n/a71.458.784.2
firenetyesyesnononononononono22nonoAnonymous
more details
n/a75.566.484.5
DeepLabv3yesyesyesyesnonononononononononoRethinking Atrous Convolution for Semantic Image SegmentationLiang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig AdamarXiv preprintIn this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects
at multiple scales, we employ a module, called Atrous Spatial Pyrmid Pooling (ASPP), which adopts atrous convolution in parallel to capture multi-scale context with multiple atrous rates. Furthermore, we propose to augment ASPP module with image-level features encoding global context and further boost performance.
Results obtained with a single model (no ensemble), trained with fine + coarse annotations. More details will be shown in the updated arXiv report.
more details
n/a81.774.089.4
EdgeSenseSegyesyesnonononononononononononoAnonymousDeep segmentation network with hard negative mining and other tricks.
more details
n/a78.569.787.3
iFLYTEK-CVyesyesyesyesnonononononononononoIFLYTEK RESEARCHIFLYTEK CV Group - YinLinBoth fine(train&val) and coarse data were used to train a novel segmentation framework.
more details
n/a79.570.588.5
ScaleNetyesyesyesyesnonononononononononoScaleNet: Scale Invariant Network for Semantic Segmentation in Urban Driving ScenesMohammad Dawud Ansari, Stephan Krarß, Oliver Wasenmüller and Didier StrickerInternational Conference on Computer Vision Theory and Applications, Funchal, Portugal, 2018The scale difference in driving scenarios is one of the essential challenges in semantic scene segmentation.
Close objects cover significantly more pixels than far objects. In this paper, we address this challenge with a
scale invariant architecture. Within this architecture, we explicitly estimate the depth and adapt the pooling
field size accordingly. Our model is compact and can be extended easily to other research domains. Finally,
the accuracy of our approach is comparable to the state-of-the-art and superior for scale problems. We evaluate
on the widely used automotive dataset Cityscapes as well as a self-recorded dataset.
more details
n/a76.866.986.7
K-netyesyesnonononononononononononoXinLiang Zhong
more details
n/a75.464.686.3
arsaityesyesnonononononononononononoanonymousanonymousanonymous
more details
0.461.845.478.2
MSNETyesyesnonononononononononononoAnonymouspreviously also listed as "MultiPathJoin" and "MultiPath_Scale".
more details
0.281.675.088.3
Multitask LearningyesyesnonononononononononononoMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaNumerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
n/a77.768.087.4
DeepMotionyesyesnonononononononononononoAnonymousWe propose a novel method based on convnets to extract multi-scale features in a large range particularly for solving street scene segmentation.
more details
n/a78.168.587.8
SR-AICyesyesyesyesnonononononononononoAnonymous
more details
n/a79.670.289.0
Roadstar.ai_CV(SFNet)yesyesnonononononononononononoRoadstar.ai-CVMaosheng Ye, Guang Zhou, Tongyi Cao, YongTao Huang, Yinzi Chensame foucs net(SFNet), based on only fine labels, with focus on the loss distribution and same focus on the every layer of feature map
more details
0.282.676.488.7
Mapillary Research: In-Place Activated BatchNormyesyesyesyesnononononononononoyesIn-Place Activated BatchNorm for Memory-Optimized Training of DNNsSamuel Rota Bulò, Lorenzo Porzi, Peter KontschiederarXivIn-Place Activated Batch Normalization (InPlace-ABN) is a novel approach to drastically reduce the training memory footprint of modern deep neural networks in a computationally efficient way. Our solution substitutes the conventionally used succession of BatchNorm + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. We obtain memory savings of up to 50% by dropping intermediate results and by recovering required information during the backward pass through the inversion of stored forward results, with only minor increase (0.8-2%) in computation time. Test results are obtained using a single model.
more details
n/a81.774.489.0

 
 

Instance-Level Semantic Labeling Task

 

AP on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averagepersonridercartruckbustrainmotorcyclebicycle
R-CNN + MCG convex hullyesyesnononononononono22nonoThe Cityscapes Dataset for Semantic Urban Scene UnderstandingM. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. SchieleCVPR 2016We compute MCG object proposals [1] and use their convex hulls as instance candidates. These proposals are scored by a Fast R-CNN detector [2].

[1] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marqués, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
[2] R. Girshick. Fast R-CNN. In ICCV, 2015.
more details
60.04.61.30.610.56.19.75.91.70.5
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a8.912.511.722.53.35.93.26.95.1
Instance-level Segmentation of Vehicles by Deep Contoursyesyesnononononononono22nonoInstance-level Segmentation of Vehicles by Deep ContoursJan van den Brand, Matthias Ochs and Rudolf MesterAsian Conference on Computer Vision - Workshop on Computer Vision Technologies for Smart VehicleOur method uses the fully convolutional network (FCN) for semantic labeling and for estimating the boundary of each vehicle. Even though a contour is in general a one pixel wide structure which cannot be directly learned by a CNN, our network addresses this by providing areas around the contours. Based on these areas, we separate the individual vehicle instances.
more details
0.22.30.00.018.20.00.00.00.00.0
Boundary-aware Instance Segmentationyesyesnononononononono22nonoBoundary-aware Instance SegmentationZeeshan Hayder, Xuming He, Mathieu SalzmannCVPR 2017End-to-end model for instance segmentation using VGG16 network

Previously listed as "Shape-Aware Instance Segmentation"
more details
n/a17.414.612.935.716.023.219.010.37.8
RecAttendyesyesnononononononono44nonoAnonymous
more details
n/a9.59.23.127.58.012.17.94.83.3
Joint Graph Decomposition and Node Labelingyesyesnononononononono88nonoJoint Graph Decomposition and Node Labeling: Problem, Algorithms, ApplicationsEvgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, Bjoern AndresComputer Vision and Pattern Recognition (CVPR) 2017
more details
n/a9.86.59.323.16.710.910.36.84.6
InstanceCutyesyesyesyesnonononononononononoInstanceCut: from Edges to Instances with MultiCutA. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, C. RotherComputer Vision and Pattern Recognition (CVPR) 2017InstanceCut represents the problem by two output modalities: (i) an instance-agnostic semantic segmentation and (ii) all instance-boundaries. The former is computed from a standard CNN for semantic segmentation, and the latter is derived from a new instance-aware edge detection model. To reason globally about the optimal partitioning of an image into instances, we combine these two modalities into a novel MultiCut formulation.
more details
n/a13.010.08.023.714.019.515.29.34.7
Pixelwise Instance Segmentation with a Dynamically Instantiated NetworkyesyesyesyesnonononononononononoPixelwise Instance Segmentation with a Dynamically Instantiated NetworkAnurag Arnab and Philip TorrComputer Vision and Pattern Recognition (CVPR) 2017We propose an Instance Segmentation system that produces a segmentation map where each pixel is assigned an object class and instance identity label. Our method is based on an initial semantic segmentation module which feeds into an instance subnetwork. This subnetwork uses the initial category-level segmentation, along with cues from the output of an object detector, within an end-to-end CRF to predict instances. This part of our model is dynamically instantiated to produce a variable number of instances per image. Our end-to-end approach requires no post-processing and considers the image holistically, instead of processing independent proposals. As a result, it reasons about occlusions (unlike some related work, a single pixel cannot belong to multiple instances).
more details
n/a20.016.516.725.720.630.023.417.110.1
Semantic Instance Segmentation with a Discriminative Loss Functionyesyesnononononononono22noyesSemantic Instance Segmentation with a Discriminative Loss FunctionBert De Brabandere, Davy Neven, Luc Van GoolDeep Learning for Robotic Vision, workshop at CVPR 2017This method uses a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin.

Previously listed as "PPLoss".
more details
n/a17.513.516.224.416.823.919.215.210.7
SGNyesyesyesyesnonononononononononoSGN: Sequential Grouping Networks for Instance SegmentationShu Liu, Jiaya Jia, Sanja Fidler, Raquel UrtasunICCV 2017Instance segmentation using a sequence of neural networks, each solving a sub-grouping problem of increasing semantic complexity in order to gradually compose objects out of pixels.
more details
n/a25.021.820.139.424.833.230.817.712.4
Mask R-CNN [COCO]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes [fine-only] + COCO
more details
n/a32.034.827.049.130.140.930.924.118.7
Mask R-CNN [fine-only]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes fine-only
more details
n/a26.230.523.746.922.832.218.619.116.0
Deep Watershed Transformationyesyesnononononononono22nonoDeep Watershed Transformation for Instance SegmentationMin Bai and Raquel UrtasunCVPR 2017Instance segmentation using a watershed transformation inspired CNN. The input RGB image is augmented using the semantic segmentation from the recent PSPNet by H. Zhao et al.
Previously named "DWT".
more details
n/a19.415.514.131.522.527.022.913.98.0
Foveal Vision for Instance Segmentation of Road ImagesyesyesnonononoyesyesnonononononoFoveal Vision for Instance Segmentation of Road ImagesBenedikt Ortelt, Christian Herrmann, Dieter Willersinn, Jürgen BeyererVISAPP 2018Directly based on 'Pixel-level Encoding for Instance Segmentation'. Adds an improved angular distance measure and a foveal concept to better address small objects at the vanishing point of the road.
more details
n/a12.513.411.424.59.414.512.28.06.7
SegNetyesyesyesyesnonononononononononoAnonymous
more details
0.529.529.923.443.429.841.033.318.716.7
LCISyesyesnonononononononononononoAnonymous
more details
n/a15.115.114.823.712.916.815.412.49.3

 
 

AP 50 % on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averagepersonridercartruckbustrainmotorcyclebicycle
R-CNN + MCG convex hullyesyesnononononononono22nonoThe Cityscapes Dataset for Semantic Urban Scene UnderstandingM. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. SchieleCVPR 2016We compute MCG object proposals [1] and use their convex hulls as instance candidates. These proposals are scored by a Fast R-CNN detector [2].

[1] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marqués, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
[2] R. Girshick. Fast R-CNN. In ICCV, 2015.
more details
60.012.95.63.926.013.826.315.88.63.1
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a21.131.833.837.87.612.08.520.517.2
Instance-level Segmentation of Vehicles by Deep Contoursyesyesnononononononono22nonoInstance-level Segmentation of Vehicles by Deep ContoursJan van den Brand, Matthias Ochs and Rudolf MesterAsian Conference on Computer Vision - Workshop on Computer Vision Technologies for Smart VehicleOur method uses the fully convolutional network (FCN) for semantic labeling and for estimating the boundary of each vehicle. Even though a contour is in general a one pixel wide structure which cannot be directly learned by a CNN, our network addresses this by providing areas around the contours. Based on these areas, we separate the individual vehicle instances.
more details
0.23.70.00.029.20.00.00.00.00.0
Boundary-aware Instance Segmentationyesyesnononononononono22nonoBoundary-aware Instance SegmentationZeeshan Hayder, Xuming He, Mathieu SalzmannCVPR 2017End-to-end model for instance segmentation using VGG16 network

Previously listed as "Shape-Aware Instance Segmentation"
more details
n/a36.734.040.454.727.240.138.932.226.0
RecAttendyesyesnononononononono44nonoAnonymous
more details
n/a18.921.212.741.913.920.715.514.710.5
Joint Graph Decomposition and Node Labelingyesyesnononononononono88nonoJoint Graph Decomposition and Node Labeling: Problem, Algorithms, ApplicationsEvgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, Bjoern AndresComputer Vision and Pattern Recognition (CVPR) 2017
more details
n/a23.218.429.538.316.121.524.521.416.0
InstanceCutyesyesyesyesnonononononononononoInstanceCut: from Edges to Instances with MultiCutA. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, C. RotherComputer Vision and Pattern Recognition (CVPR) 2017InstanceCut represents the problem by two output modalities: (i) an instance-agnostic semantic segmentation and (ii) all instance-boundaries. The former is computed from a standard CNN for semantic segmentation, and the latter is derived from a new instance-aware edge detection model. To reason globally about the optimal partitioning of an image into instances, we combine these two modalities into a novel MultiCut formulation.
more details
n/a27.928.026.844.822.230.430.125.115.7
Pixelwise Instance Segmentation with a Dynamically Instantiated NetworkyesyesyesyesnonononononononononoPixelwise Instance Segmentation with a Dynamically Instantiated NetworkAnurag Arnab and Philip TorrComputer Vision and Pattern Recognition (CVPR) 2017We propose an Instance Segmentation system that produces a segmentation map where each pixel is assigned an object class and instance identity label. Our method is based on an initial semantic segmentation module which feeds into an instance subnetwork. This subnetwork uses the initial category-level segmentation, along with cues from the output of an object detector, within an end-to-end CRF to predict instances. This part of our model is dynamically instantiated to produce a variable number of instances per image. Our end-to-end approach requires no post-processing and considers the image holistically, instead of processing independent proposals. As a result, it reasons about occlusions (unlike some related work, a single pixel cannot belong to multiple instances).
more details
n/a38.837.142.145.730.244.740.541.828.4
Semantic Instance Segmentation with a Discriminative Loss Functionyesyesnononononononono22noyesSemantic Instance Segmentation with a Discriminative Loss FunctionBert De Brabandere, Davy Neven, Luc Van GoolDeep Learning for Robotic Vision, workshop at CVPR 2017This method uses a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin.

Previously listed as "PPLoss".
more details
n/a35.932.040.743.228.539.135.737.929.8
SGNyesyesyesyesnonononononononononoSGN: Sequential Grouping Networks for Instance SegmentationShu Liu, Jiaya Jia, Sanja Fidler, Raquel UrtasunICCV 2017Instance segmentation using a sequence of neural networks, each solving a sub-grouping problem of increasing semantic complexity in order to gradually compose objects out of pixels.
more details
n/a44.945.247.759.736.345.453.739.531.8
Mask R-CNN [COCO]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes [fine-only] + COCO
more details
n/a58.167.165.471.842.361.053.954.349.0
Mask R-CNN [fine-only]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes fine-only
more details
n/a49.960.759.568.333.148.238.946.543.9
Deep Watershed Transformationyesyesnononononononono22nonoDeep Watershed Transformation for Instance SegmentationMin Bai and Raquel UrtasunCVPR 2017Instance segmentation using a watershed transformation inspired CNN. The input RGB image is augmented using the semantic segmentation from the recent PSPNet by H. Zhao et al.
Previously named "DWT".
more details
n/a35.334.036.948.531.340.136.232.922.9
Foveal Vision for Instance Segmentation of Road ImagesyesyesnonononoyesyesnonononononoFoveal Vision for Instance Segmentation of Road ImagesBenedikt Ortelt, Christian Herrmann, Dieter Willersinn, Jürgen BeyererVISAPP 2018Directly based on 'Pixel-level Encoding for Instance Segmentation'. Adds an improved angular distance measure and a foveal concept to better address small objects at the vanishing point of the road.
more details
n/a25.231.529.740.016.023.821.719.219.9
SegNetyesyesyesyesnonononononononononoAnonymous
more details
0.555.657.956.567.546.265.163.043.944.8
LCISyesyesnonononononononononononoAnonymous
more details
n/a30.833.337.442.521.927.627.932.023.9

 
 

AP 100 m on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitle),(orsvenuedescriptionRuntime [s]averagepersonridercartruckbustrainmotorcyclebicycle
R-CNN + MCG convex hullyesyesnononononononono22nonoThe Cityscapes Dataset for Semantic Urban Scene UnderstandingM. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. SchieleCVPR 2016We compute MCG object proposals [1] and use their convex hulls as instance candidates. These proposals are scored by a Fast R-CNN detector [2].

[1] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marqués, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
[2] R. Girshick. Fast R-CNN. In ICCV, 2015.
more details
60.07.72.61.117.510.617.49.22.60.9
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a15.324.420.336.45.510.65.210.59.2
Instance-level Segmentation of Vehicles by Deep Contoursyesyesnononononononono22nonoInstance-level Segmentation of Vehicles by Deep ContoursJan van den Brand, Matthias Ochs and Rudolf MesterAsian Conference on Computer Vision - Workshop on Computer Vision Technologies for Smart VehicleOur method uses the fully convolutional network (FCN) for semantic labeling and for estimating the boundary of each vehicle. Even though a contour is in general a one pixel wide structure which cannot be directly learned by a CNN, our network addresses this by providing areas around the contours. Based on these areas, we separate the individual vehicle instances.
more details
0.23.90.00.031.00.00.00.00.00.0
Boundary-aware Instance Segmentationyesyesnononononononono22nonoBoundary-aware Instance SegmentationZeeshan Hayder, Xuming He, Mathieu SalzmannCVPR 2017End-to-end model for instance segmentation using VGG16 network

Previously listed as "Shape-Aware Instance Segmentation"
more details
n/a29.330.322.758.224.938.629.915.314.3
RecAttendyesyesnononononononono44nonoAnonymous
more details
n/a16.819.65.546.814.221.513.17.26.1
Joint Graph Decomposition and Node Labelingyesyesnononononononono88nonoJoint Graph Decomposition and Node Labeling: Problem, Algorithms, ApplicationsEvgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, Bjoern AndresComputer Vision and Pattern Recognition (CVPR) 2017
more details
n/a16.813.516.638.411.319.216.910.48.3
InstanceCutyesyesyesyesnonononononononononoInstanceCut: from Edges to Instances with MultiCutA. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, C. RotherComputer Vision and Pattern Recognition (CVPR) 2017InstanceCut represents the problem by two output modalities: (i) an instance-agnostic semantic segmentation and (ii) all instance-boundaries. The former is computed from a standard CNN for semantic segmentation, and the latter is derived from a new instance-aware edge detection model. To reason globally about the optimal partitioning of an image into instances, we combine these two modalities into a novel MultiCut formulation.
more details
n/a22.119.714.038.924.834.423.113.78.0
Pixelwise Instance Segmentation with a Dynamically Instantiated NetworkyesyesyesyesnonononononononononoPixelwise Instance Segmentation with a Dynamically Instantiated NetworkAnurag Arnab and Philip TorrComputer Vision and Pattern Recognition (CVPR) 2017We propose an Instance Segmentation system that produces a segmentation map where each pixel is assigned an object class and instance identity label. Our method is based on an initial semantic segmentation module which feeds into an instance subnetwork. This subnetwork uses the initial category-level segmentation, along with cues from the output of an object detector, within an end-to-end CRF to predict instances. This part of our model is dynamically instantiated to produce a variable number of instances per image. Our end-to-end approach requires no post-processing and considers the image holistically, instead of processing independent proposals. As a result, it reasons about occlusions (unlike some related work, a single pixel cannot belong to multiple instances).
more details
n/a32.631.828.241.133.748.935.324.017.7
Semantic Instance Segmentation with a Discriminative Loss Functionyesyesnononononononono22noyesSemantic Instance Segmentation with a Discriminative Loss FunctionBert De Brabandere, Davy Neven, Luc Van GoolDeep Learning for Robotic Vision, workshop at CVPR 2017This method uses a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin.

Previously listed as "PPLoss".
more details
n/a27.825.127.540.024.439.426.522.217.9
SGNyesyesyesyesnonononononononononoSGN: Sequential Grouping Networks for Instance SegmentationShu Liu, Jiaya Jia, Sanja Fidler, Raquel UrtasunICCV 2017Instance segmentation using a sequence of neural networks, each solving a sub-grouping problem of increasing semantic complexity in order to gradually compose objects out of pixels.
more details
n/a38.936.732.760.139.953.744.124.420.0
Mask R-CNN [COCO]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes [fine-only] + COCO
more details
n/a45.851.339.367.942.858.846.831.427.9
Mask R-CNN [fine-only]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes fine-only
more details
n/a37.646.235.665.531.146.027.524.924.3
Deep Watershed Transformationyesyesnononononononono22nonoDeep Watershed Transformation for Instance SegmentationMin Bai and Raquel UrtasunCVPR 2017Instance segmentation using a watershed transformation inspired CNN. The input RGB image is augmented using the semantic segmentation from the recent PSPNet by H. Zhao et al.
Previously named "DWT".
more details
n/a31.427.723.150.837.946.433.719.412.7
Foveal Vision for Instance Segmentation of Road ImagesyesyesnonononoyesyesnonononononoFoveal Vision for Instance Segmentation of Road ImagesBenedikt Ortelt, Christian Herrmann, Dieter Willersinn, Jürgen BeyererVISAPP 2018Directly based on 'Pixel-level Encoding for Instance Segmentation'. Adds an improved angular distance measure and a foveal concept to better address small objects at the vanishing point of the road.
more details
n/a20.424.519.639.314.524.218.511.111.1
SegNetyesyesyesyesnonononononononononoAnonymous
more details
0.543.248.135.161.443.261.744.626.125.4
LCISyesyesnonononononononononononoAnonymous
more details
n/a24.228.624.536.821.127.021.617.616.2

 
 

AP 50 m on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averagepersonridercartruckbustrainmotorcyclebicycle
R-CNN + MCG convex hullyesyesnononononononono22nonoThe Cityscapes Dataset for Semantic Urban Scene UnderstandingM. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. SchieleCVPR 2016We compute MCG object proposals [1] and use their convex hulls as instance candidates. These proposals are scored by a Fast R-CNN detector [2].

[1] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marqués, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
[2] R. Girshick. Fast R-CNN. In ICCV, 2015.
more details
60.010.32.71.121.214.025.214.22.71.0
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a16.725.021.040.76.713.56.411.29.3
Instance-level Segmentation of Vehicles by Deep Contoursyesyesnononononononono22nonoInstance-level Segmentation of Vehicles by Deep ContoursJan van den Brand, Matthias Ochs and Rudolf MesterAsian Conference on Computer Vision - Workshop on Computer Vision Technologies for Smart VehicleOur method uses the fully convolutional network (FCN) for semantic labeling and for estimating the boundary of each vehicle. Even though a contour is in general a one pixel wide structure which cannot be directly learned by a CNN, our network addresses this by providing areas around the contours. Based on these areas, we separate the individual vehicle instances.
more details
0.24.90.00.039.00.00.00.00.00.0
Boundary-aware Instance Segmentationyesyesnononononononono22nonoBoundary-aware Instance SegmentationZeeshan Hayder, Xuming He, Mathieu SalzmannCVPR 2017End-to-end model for instance segmentation using VGG16 network

Previously listed as "Shape-Aware Instance Segmentation"
more details
n/a34.031.523.463.132.250.540.416.514.6
RecAttendyesyesnononononononono44nonoAnonymous
more details
n/a20.920.75.854.217.932.121.97.86.4
Joint Graph Decomposition and Node Labelingyesyesnononononononono88nonoJoint Graph Decomposition and Node Labeling: Problem, Algorithms, ApplicationsEvgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, Bjoern AndresComputer Vision and Pattern Recognition (CVPR) 2017
more details
n/a20.314.017.443.915.026.126.211.68.5
InstanceCutyesyesyesyesnonononononononononoInstanceCut: from Edges to Instances with MultiCutA. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, C. RotherComputer Vision and Pattern Recognition (CVPR) 2017InstanceCut represents the problem by two output modalities: (i) an instance-agnostic semantic segmentation and (ii) all instance-boundaries. The former is computed from a standard CNN for semantic segmentation, and the latter is derived from a new instance-aware edge detection model. To reason globally about the optimal partitioning of an image into instances, we combine these two modalities into a novel MultiCut formulation.
more details
n/a26.120.114.642.532.344.731.714.38.2
Pixelwise Instance Segmentation with a Dynamically Instantiated NetworkyesyesyesyesnonononononononononoPixelwise Instance Segmentation with a Dynamically Instantiated NetworkAnurag Arnab and Philip TorrComputer Vision and Pattern Recognition (CVPR) 2017We propose an Instance Segmentation system that produces a segmentation map where each pixel is assigned an object class and instance identity label. Our method is based on an initial semantic segmentation module which feeds into an instance subnetwork. This subnetwork uses the initial category-level segmentation, along with cues from the output of an object detector, within an end-to-end CRF to predict instances. This part of our model is dynamically instantiated to produce a variable number of instances per image. Our end-to-end approach requires no post-processing and considers the image holistically, instead of processing independent proposals. As a result, it reasons about occlusions (unlike some related work, a single pixel cannot belong to multiple instances).
more details
n/a37.632.829.044.239.960.751.724.917.8
Semantic Instance Segmentation with a Discriminative Loss Functionyesyesnononononononono22noyesSemantic Instance Segmentation with a Discriminative Loss FunctionBert De Brabandere, Davy Neven, Luc Van GoolDeep Learning for Robotic Vision, workshop at CVPR 2017This method uses a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin.

Previously listed as "PPLoss".
more details
n/a31.025.128.244.028.647.732.523.518.0
SGNyesyesyesyesnonononononononononoSGN: Sequential Grouping Networks for Instance SegmentationShu Liu, Jiaya Jia, Sanja Fidler, Raquel UrtasunICCV 2017Instance segmentation using a sequence of neural networks, each solving a sub-grouping problem of increasing semantic complexity in order to gradually compose objects out of pixels.
more details
n/a44.536.833.363.250.767.459.225.320.0
Mask R-CNN [COCO]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes [fine-only] + COCO
more details
n/a49.551.540.169.949.369.255.931.927.9
Mask R-CNN [fine-only]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes fine-only
more details
n/a40.146.235.967.437.851.232.925.324.3
Deep Watershed Transformationyesyesnononononononono22nonoDeep Watershed Transformation for Instance SegmentationMin Bai and Raquel UrtasunCVPR 2017Instance segmentation using a watershed transformation inspired CNN. The input RGB image is augmented using the semantic segmentation from the recent PSPNet by H. Zhao et al.
Previously named "DWT".
more details
n/a36.827.423.753.547.164.345.120.213.1
Foveal Vision for Instance Segmentation of Road ImagesyesyesnonononoyesyesnonononononoFoveal Vision for Instance Segmentation of Road ImagesBenedikt Ortelt, Christian Herrmann, Dieter Willersinn, Jürgen BeyererVISAPP 2018Directly based on 'Pixel-level Encoding for Instance Segmentation'. Adds an improved angular distance measure and a foveal concept to better address small objects at the vanishing point of the road.
more details
n/a22.124.720.242.517.227.621.811.711.3
SegNetyesyesyesyesnonononononononononoAnonymous
more details
0.545.848.035.262.550.870.048.126.325.3
LCISyesyesnonononononononononononoAnonymous
more details
n/a25.829.325.338.127.027.724.318.116.3