A Call for Rigorous Validation in 3D Medical Image Segmentation (2024)

11institutetext: Division of Medical Image Computing, German Cancer Research Center (DKFZ), Heidelberg, Germany 22institutetext: Interactive Machine Learning Group (IML), DKFZ, Heidelberg, Germany 33institutetext: Helmholtz Imaging, DKFZ, Heidelberg, Germany. 44institutetext: Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany 55institutetext: National Center for Tumor Diseases (NCT) Heidelberg, Germany
66institutetext: Medical Faculty Heidelberg, University of Heidelberg, Heidelberg, Germany
77institutetext: Faculty of Mathematics and Computer Science, University of Heidelberg, Heidelberg, Germany
77email: f.isensee@dkfz-heidelberg.de

Fabian Isensee*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT1133   Tassilo Wald*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT113377   Constantin Ulrich*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT115566   Michael Baumgartner*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT113377   Saikat Roy11   Klaus Maier-Hein{}^{{\dagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT113344556677   Paul Jäger{}^{{\dagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT2233

Abstract

The release of nnU-Net marked a paradigm shift in 3D medical image segmentation, demonstrating that a properly configured U-Net architecture could still achieve state-of-the-art results. Despite this, the pursuit of novel architectures, and the respective claims of superior performance over the U-Net baseline, continued. In this study, we demonstrate that many of these recent claims fail to hold up when scrutinized for common validation shortcomings, such as the use of inadequate baselines, insufficient datasets, and neglected computational resources. By meticulously avoiding these pitfalls, we conduct a thorough and comprehensive benchmarking of current segmentation methods including CNN-based, Transformer-based, and Mamba-based approaches. In contrast to current beliefs, we find that the recipe for state-of-the-art performance is 1) employing CNN-based U-Net models, including ResNet and ConvNeXt variants, 2) using the nnU-Net framework, and 3) scaling models to modern hardware resources. These results indicate an ongoing innovation bias towards novel architectures in the field and underscore the need for more stringent validation standards in the quest for scientific progress.

Keywords:

Medical Image Segmentation, Validation, Benchmark

**footnotetext: Equal contribution. Authors are permitted to list their name first in their CVs.footnotetext: Equal supervision.

1 Introduction

Medical image segmentation remains a highly active area of research, evidenced by the U-Net architecture receiving over 20,000 citations in 2023 alone [29].The introduction of nnU-Net in 2018 was a pivotal moment, highlighting that careful implementation and configuration of the architecture are more crucial for achieving state-of-the-art results than modifying the architecture itself [23, 21].Despite this, the attraction of innovative architectures from the broader computer vision domain, such as Transformers [33] and Mamba [13], persists. Adaptations of these cutting-edge designs to the medical imaging domain have emerged, with claims of superior performance over the conventional CNN-based U-Net [32, 11, 15, 17, 41, 38, 26, 9, 12, 37, 34].

In this paper, we critically examine these claims and find that the current rapid adoption of new methods to the medical domain comes with a lack of stringent validation. As a consequence, we observe that many recent claims of methodological superiority do not hold when systematically tested in a comprehensive benchmark. This trend raises significant concerns, indicating a prevailing attention bias in medical image segmentation towards novel architectures. To overcome this bias and redirect the field towards meaningful methodological progress, we call for a systemic change emphasizing rigorous validation practices. Our study makes the following contributions:

  1. 1.

    We systematically identify validation pitfalls in the field and provide recommendations for how to avoid them.

  2. 2.

    We conduct a large-scale benchmark under a thorough validation protocol to scrutinize the performance of prevalent segmentation methods.

  3. 3.

    Based on this analysis, we identify key methodological components for medical image segmentation as well as a set of suitable benchmarking datasets.

  4. 4.

    We release a series of updated standardized baselines for 3D medical segmentation at https://github.com/MIC-DKFZ/nnUNet. These are based on a residual encoder U-Net within the nnU-Net framework and tailored to accommodate a spectrum of hardware capabilities ("M", "L", "XL").

2 Validation Pitfalls

In the following, we present a collection of predominant validation pitfalls in current practice paired with recommendations on how to avoid them. In Section4, we underscore the critical need for this initiative by empirically demonstrating how these pitfalls lead to unsupported claims of methodological superiority.

2.1 Baseline-related Pitfalls

Providing a fair and comprehensive comparison to existing work is essential for scientific progress. Currently, we observe a lack of rigour in ensuring meaningful comparison.

P1: Coupling the claimed innovation with confounding performance boosters: There are multiple ways to artificially boost a method’s performance, obfuscating the real impact of the claimed innovation. One example is coupling the claimed innovation with residual connections in the encoder while the baseline uses a vanilla CNN encoder [26]. Another example is coupling the claimed innovation with additional training data not used in baselines [16]. This is even more critical, if the usage of additional data is not made transparent[3]. A related pitfall is to couple the claimed innovation with self-supervised pretraining, while the baselines train from scratch [32]. A third example is coupling the claimed innovation with larger hardware capabilities, i.e. comparing against baselines that are not scaled to the same compute budget (VRAM usage and training time) [31]. Finally, sometimes claimed innovations are based solely on leaderboard results where the method is coupled with 20-fold ensembling, while other leaderboard entries do not use such costly performance boosters[32, 16]. Recommendation (R1): Meaningful validation entirely isolates the effect of the claimed innovation by ensuring a fair comparison to baselines where the proposed method is not coupled with confounding performance boosters.

P2: Lack of well-configured and standardized baselines: nnU-Net has demonstrated that proper method configuration often impacts performance more significantly than the architecture itself [21]. This suggests that claims of methodological superiority may be misleading if based on comparisons against an ill-configured baseline (i.e. a manually configured U-Net [29] with nontransparent and potentially subpar hyperparameter optimization). Some methods, like nnU-Net, address the "faulty baseline" problem by offering automatic, high-quality, and thus standardized, configuration on new datasets. Despite this, many studies continue to claim methodological superiority without benchmarking against any such standardized baseline with a proven high-quality configuration [38, 37, 34, 12, 9, 11, 40]. Beyond auto-configuration frameworks like nnU-Net, it is almost impossible to ensure a high-quality configuration when including existing methods as baselines, because typically no instructions for adaptation to new tasks are provided. This need for manual adjustments, even if equal hyperparameter tuning budget is allocated to all methods, is an error-prone process that ultimately diminishes the relevance of results. Recommendation (R2): Beyond the call for ensuring high-quality configuration of baselines, long-term standardization in the field can only be achieved if newly proposed methods are equipped with adaptation instructions, or ideally, are carefully integrated within auto-configuration frameworks to inherit their capabilities.

2.2 Dataset-related Pitfalls

P3: Insufficient quantity and suitability of datasets: The nnU-Net study contains experiments demonstrating 1) the vast diversity of biomedical datasets and 2) the corresponding need of testing on a sufficient number and variety of datasets when making claims about general methodological advancements [21]. However, the median number of datasets employed in recent studies claiming superior segmentation performance is three [32, 16, 18, 28, 17, 26, 38, 20, 31, 41, 37, 12, 9, 11]. Though the number might seem unremarkable on its own, it becomes concerning when considering the varying benchmarking suitability of popular datasets. For instance, as we empirically analyse in Section4, neither of the two datasets BTCV [25] and BraTS [5, 27, 6], while being useful environments for solving their respective clinical task, provide a reliable foundation for assessing general methodological advancements. This is due to a high statistical variance (BTCV) and a low systematic variance (BraTS). Despite this, numerous studies claim methodological superiority while at least 50%percent5050\%50 % of the benchmark is made up by either BTCV [37, 11, 9] or BraTS [14, 34, 31, 28, 16]. Recommendation (R3): Meaningful validation requires that utilized datasets are a suitable basis for measuring the claimed methodological advancement. This may include sufficient dataset quantity and diversity, as well as benchmarking suitability of individual datasets, as assessed in our study in Section4.

P4: Inconsistent reporting practices:Standardization of public leaderboard submissions is limited, e.g. allowing varying strategies for ensembling, test time augmentations and post-processing techniques. While perfectly serving the need to demonstrate that a proposed method can push the state-of-the-art when equipped with all bells and whistles, such non-standardized settings undermine the ability to draw meaningful methodological conclusions. Consequently, researchers often resort to custom train/test splits for controlled comparisons against baselines, but these typically involve small test sets that introduce substantial result instability and question the significance of minor performance gains [36, 39, 41]. Further, the practice of reporting selective results only for specific classes from datasets without justifiable reasons further compromises result integrity [41, 11, 9, 36, 39]. Recommendation (R4): 5-fold cross-validation with a rotating validation set improves reliability and often represents a pragmatic solution. However, using the same dataset(s) for development and validation bears the risk of implicit overfitting and a lack of generalizability. Thus, ideally, differentiating between a pool of development datasets and an independent pool of test datasets for cross-validation against baselines would offer a more reliable assessment of method performance.

3 Systematic 3D Medical Segmentation Benchmark

Taking these pitfalls and recommendations into account, we revisit recently proposed methods on basis of a systematic and comprehensive benchmark.

3.1 Compared Methods

We categorize methods into CNN-based, Transformer-based, and Mamba-based.CNN-based: We include nnU-Net’s original configuration using a vanilla U-Net as well as a variant employing a U-Net with residual connections in the encoder ("nnU-Net ResEnc") which has been part of the official repository since 2019 [22]. In the spirit of avoiding benchmarking of unequal hardware settings in the future (see P1), we introduce new nnU-Net ResEnc presets, which use nnU-Net’s existing automatic adaptation of batch and patch sizes to target varying VRAM budgets ("M", "L", "XL"). We further include MedNeXt, a transformer-inspired CNN-modification using ConvNeXt blocks (we test size "L" with kernel sizes "k3" and upkernel "k5") [31], and STU-Net, a series of scaled up U-Nets with increasing parameter counts named "S"(mall), "B"(ase), "L"(arge), and "H"(uge) [20].Transformer-based: We test the SwinUNETR’s original version [32] as well as version 2 [17], nnFormer [41], and CoTr, a hybrid architecture combining convolutional and transformer modules [37].Mamba-based: We test the recently proposed U-Mamba model [26] employing Mamba-layers either in the U-Net encoder ("U-Mamba Enc"), or exclusively in the bottleneck ("U-Mamba Bot"). We also include an ablation missing in the original publication using the identical setting while switching off the mamba layers ("No-Mamba Base"). All aforementioned methods were originally implemented in the nnU-Net framework except SwinUNETR(V1+V2), which we integrate into the nnU-Net framework due to incomplete configuration instructions (P2). Framework comparison: In addition to comparing recent methods, we also benchmark nnU-Net against a recent alternative framework: Auto3DSeg(Version 1.3.01.3.01.3.01.3.0)[1, 28, 18, 32] is part of the MONAI eco-system [10] and recently created a buzz at MICCAI 2023 by winning several highly competitive challenges like KiTS2023, thereby positioning itself as an alternative to nnU-Net promising the same auto-configuration functionality[1]. The framework is tested by means of three featured architectures ("SegResNet"[28], "DiNTS"[18], "SwinUNETR"[32]).

In the spirit of R1 and R2, we employ a standardized scheme for hyperparameter configuration by either 1) using the self-configuration abilities of methods if available, 2) selecting the configuration closest to the respective dataset if multiple configurations were provided, 3) using the default configuration in case no alternatives were provided or 4) where necessary, decreasing the learning rate until convergence was achieved. All models are trained from scratch. The only exception is SwinUNETR in the Auto3DSeg framework. Altering its default of automatically loading pre-trained weights would have contradicted our hyperparameter configuration scheme. We also employed an equal maximum VRAM budget across all methods by running all trainings on a single NVIDIA A100 with 40GB VRAM. This budget excludes the largest STU-Net variant ("H") from our benchmark.

3.2 Utilized Datasets

Our benchmark utilizes six datasets: BTCV [25], ACDC [7], LiTS [8, 4]BraTS2021 [5, 27, 6]KiTS2023 [19], and AMOS2022 (post challenge Task 2) [24]. We selected datasets based on popularity, allowing us to follow R3 and assess the prevalent datasets w.r.t their suitability for method benchmarking. Given that an effective benchmarking dataset should enable measuring consistent signals of methodological differences, we derive two requirements for suitability: 1) low standard deviation (SD) of DSC Scores from the same method across the five folds (intra-method SD) indicating statistical stability and a high signal-to-noise ratio. And 2), high SD across different methods (inter-method SD) indicating meaningful signals of methodological differences, i.e. performance does not saturate too fast on the respective task. Our final suitability score is the ratio of inter-method versus intra-method SD.

Following R4, we report results using 5-fold cross-validation, employing splits generated by nnU-Net and applying these consistently across all methods. Since we do not develop new methods in this study, we refrain from distinguishing between development versus test dataset pools. We report results with the average Dice Similarity Coefficients (DSC) being our primary metric, and the Normalized Surface Distance (NSD) as our secondary metric. For both metrics, results are averaged over all classes of each dataset as well as over the five folds to assess generalist segmentation capabilities without delving into problem-specific metric nuances. For datasets featuring hierarchical evaluation regions (BraTS2021, KiTS2023), we calculate metrics for these regions rather than the non-overlapping classes.

4 Results and Discussion

A Call for Rigorous Validation in 3D Medical Image Segmentation (1)

KiTS, AMOS, and ACDC are the most suitable datasets for benchmarking 3D segmentation methods.Fig. 1 shows the outcome of the dataset analysis based on our benchmark (for detailed results see Appendix Table3). We find that KiTS, AMOS, and ACDC exhibit low statistical noise (intra-method SD) while effectively differentiating between methods, as indicated by a high inter-method SD. Out of the three, KiTS features by far the highest inter-method SD, indicating lowest performance saturation on the task. Conversely, scores on BraTS21 are saturated, with minimal variation both between and within methods. BTCV exhibits a SD ratio below one, indicating statistical noise may exceed the signal of performance differences between methods. LiTS represents a middle ground in terms of benchmarking suitability. In summary, ACDC, AMOS, and KiTS can be recommended as the most suitable datasets for benchmarking, BraTS, LiTS, and BTCV are observed to be less suitable for this purpose.

BTCVACDCLiTSBraTSKiTSAMOSVRAMRTArch.nnU
n=30n=200n=131n=1251n=489n=360[GB][h]
nnU-Net (org.) [21] 83.08 91.54 80.09 91.24 86.04 88.647.709CNNYes
nnU-Net ResEnc M 83.31 91.99 80.75 91.26 86.79 88.779.1012CNNYes
nnU-Net ResEnc L 83.35 91.69 81.60 91.13 88.17 89.4122.7035CNNYes
nnU-Net ResEnc XL 83.28 91.48 81.19 91.18 88.67 89.6836.6066CNNYes
MedNeXt L k3 [31] 84.70 92.65 82.14 91.35 88.25 89.6217.3068CNNYes
MedNeXt L k5 [31] 85.04 92.62 82.34 91.50 87.74 89.7318.00233CNNYes
STU-Net S [20] 82.92 91.04 78.50 90.55 84.93 88.085.2010CNNYes
STU-Net B [20] 83.05 91.30 79.19 90.85 86.32 88.468.8015CNNYes
STU-Net L [20] 83.36 91.31 80.31 91.26 85.84 89.3426.5051CNNYes
SwinUNETR [32] 78.89 91.29 76.50 90.68 81.27 83.8113.1015TFYes
SwinUNETRV2 [17] 80.85 92.01 77.85 90.74 84.14 86.2413.4015TFYes
nnFormer [41] 80.86 92.40 77.40 90.22 75.85 81.555.708TFYes
CoTr [37] 81.95 90.56 79.10 90.73 84.59 88.028.2018TFYes
No-Mamba Base 83.69 91.89 80.57 91.26 85.98 89.0412.024CNNYes
U-Mamba Bot [26] 83.51 91.79 80.40 91.26 86.22 89.1312.4024MamYes
U-Mamba Enc [26] 82.41 91.22 80.27 90.91 86.34 88.3824.9047MamYes
A3DS SegResNet [1, 28] 80.69 90.69 79.28 90.79 81.11 87.2720.0022CNNNo
A3DS DiNTS [1, 18] 78.18 82.97 69.05 87.75 65.28 82.3529.2016CNNNo
A3DS SwinUNETR [1, 32] 76.54 82.68 68.59 89.90 52.82 85.0534.509TFNo

CNN-based U-Nets yield best performance.Table1 shows our experimental results (see Appendix Table4 for results measured as NSD). CNN-based U-Nets implemented in nnU-Net consistently deliver strong performance across all six datasets. Besides the original nnU-Net, this includes STUNet, ResEnc M/L/XL, MedNeXt and No-Mamba base. MedNeXt consistently stands out with the best performance on all datasets except KITS, although the gaps are smaller on the datasets with higher benchmarking suitability. Furthermore, MedNeXt’s performance gains come at a substantial cost of increased training time (especially k5). Additional experiments in Appendix Table 5 indicate that parts of MedNeXt’s advantages can be explained by target spacing selection and are thus not exclusively linked to a superior architecture. Given that STU-Net was primarily introduced with a focus on transfer learning, we analysed the effect of pre-training on the Totalsegmentator dataset in Appendix Table 6 [35].In contrast to prior claims, Transformer-based architectures (SwinUNETR, nnFormer, CoTr) fail to match the performance of CNNs. This includes not matching performance of the original nnU-Net, which has been released long before the Transformer-based architectures. CoTr shows the best results in the Transformer category, which prior literature related to its convolutional components [30].U-Mamba initially appears to perform well across segmentation tasks, but comparison against the previously missing baseline "No-Mamba Base" reveals that the mamba layers actually have no effect on performance, and instead, the originally reported gains were due to coupling the method with a residual U-Net (see P1).The fact that SegResNet shows best performance among methods implemented in Auto3DSeg underscores that the observed superiority of CNNs is not merely a bias introduced by nnU-Net.

nnU-Net is the state-of-the-art segmentation framework. We find that none of the three methods featured in Auto3DSeg reaches the original nnU-Net baseline ("org.") performance, indicating a substantial disadvantage due to the underlying Auto3DSeg framework. This negative gap occurs despite significantly lower VRAM usage and training time of the nnU-Net baseline. When comparing the two frameworks with an identical method (SwinUNETR), nnU-Net wins in 5 out of 6 datasets. Following an official Auto3DSeg tutorial[2] we improved the results via manual changes in configuration and further increasing its computing budget, but failed to reach competitive performance (see Appendix Table2). Taken together, while Auto3DSeg can be pushed to produce state-of-the-art results, as evidenced by its recent challenge wins, its out-of-the-box capabilities do not match nnU-Net.

Scaling models is important especially on larger datasetsWe tested the effect of model scaling based on two methods: nnU-Net Resenc M/L/XL and STU-Net S/B/L. We find that on the more challenging tasks AMOS and KiTS, a significant boost in performance is observed as the compute budget increases. As expected, the "easier" tasks BTCV and BraTS bear less potential for performance gains from model scaling. These findings underscore the importance of size-awareness and dataset-awareness for meaningful method comparison. For instance, evidence for the superiority of a large new segmentation model should not be based on comparison against a much smaller original nnU-Net.

5 Conclusion

Our benchmark reveals a concerning trend in 3D medical image segmentation: most methods introduced in recent years fail to surpass the original nnU-Net baseline introduced in 2018. This raises the question: How can we steer the field towards genuine progress? In this study, we link the observed shortcomings to a widespread lack of rigor in method validation. To counteract this, we introduce: 1) A systematic collection of validation pitfalls along with recommendations for their avoidance, 2) The release of updated standardized baselines facilitating meaningful method validation, 3) A strategy for measuring the suitability of datasets for method benchmarking.

Beyond these contributions, achieving true and lasting progress in the field requires a cultural shift, where the quality of validation is valued as much as the novelty of network architectures. Making this shift happen will be the responsibility of method developers, users, and reviewers alike.

6 Acknowledgements

This work was partly funded by Helmholtz Imaging (HI), a platform of the Helmholtz Incubator on Information and Data Science.

References

  • [1]Auto3dseg.LINK.Accessed: 2024-01-25.
  • [2]Auto3dseg kits23 tutorial.LINK.Accessed: 2024-03-05.
  • [3]Swinunetr comment on additional training data.https://github.com/Project-MONAI/research-contributions/issues/68.Accessed: 2024-01-25.
  • [4]M.Antonelli, A.Reinke, S.Bakas, K.Farahani, and etal.The medical segmentation decathlon.Nature communications, 2022.
  • [5]U.Baid, S.Ghodasara, S.Mohan, M.Bilello, and etal.The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification.arXiv preprint arXiv:2107.02314, 2021.
  • [6]S.Bakas, H.Akbari, A.Sotiras, M.Bilello, and etal.Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features.Scientific data, 2017.
  • [7]O.Bernard, A.Lalande, C.Zotti, Cervenansky, and etal.Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved?IEEE TMI, 2018.
  • [8]P.Bilic, P.Christ, H.B. Li, E.Vorontsov, and etal.The liver tumor segmentation benchmark (lits).Medical Image Analysis, 2023.
  • [9]H.Cao, Y.Wang, J.Chen, D.Jiang, and etal.Swin-unet: Unet-like pure transformer for medical image segmentation.In ECCV, 2022.
  • [10]M.J. Cardoso, W.Li, R.Brown, etal.Monai: An open-source framework for deep learning in healthcare.arXiv preprint arXiv:2211.02701, 2022.
  • [11]J.Chen, Y.Lu, Q.Yu, X.Luo, and etal.Transunet: Transformers make strong encoders for medical image segmentation.arXiv preprint arXiv:2102.04306, 2021.
  • [12]Y.Gao, M.Zhou, and D.N. Metaxas.Utnet: a hybrid transformer architecture for medical image segmentation.In MICCAI 2021, 2021.
  • [13]A.Gu and T.Dao.Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023.
  • [14]A.Hatamizadeh, V.Nath, Y.Tang, D.Yang, and etal.Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images.In International MICCAI Brainlesion Workshop, 2021.
  • [15]A.Hatamizadeh, Y.Tang, V.Nath, D.Yang, and etal.Unetr: Transformers for 3d medical image segmentation.In Proceedings of the WACV, 2022.
  • [16]A.Hatamizadeh, Y.Tang, V.Nath, D.Yang, and etal.Unetr: Transformers for 3d medical image segmentation.In WACV, 2022.
  • [17]Y.He, V.Nath, D.Yang, Y.Tang, and etal.Swinunetr-v2: Stronger swin transformers with stagewise convolutions for 3d medical image segmentation.In MICCAI, 2023.
  • [18]Y.He, D.Yang, H.Roth, C.Zhao, and D.Xu.Dints: Differentiable neural network topology search for 3d medical image segmentation.In Proceedings of WACV, 2021.
  • [19]N.Heller, F.Isensee, D.Trofimova, R.Tejpaul, and etal.The kits21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase ct, 2023.
  • [20]Z.Huang, H.Wang, Z.Deng, J.Ye, and etal.Stu-net: Scalable and transferable medical image segmentation models empowered by large-scale supervised pre-training.arXiv preprint arXiv:2304.06716, 2023.
  • [21]F.Isensee, P.F. Jaeger, S.A. Kohl, J.Petersen, and K.H. Maier-Hein.nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211, 2021.
  • [22]F.Isensee and K.H. Maier-Hein.An attempt at beating the 3d u-net.arXiv preprint arXiv:1908.02182, 2019.
  • [23]F.Isensee, J.Petersen, A.Klein, D.Zimmerer, P.F. Jaeger, S.Kohl, J.Wasserthal, G.Koehler, T.Norajitra, S.Wirkert, etal.nnu-net: Self-adapting framework for u-net-based medical image segmentation.arXiv preprint arXiv:1809.10486, 2018.
  • [24]Y.Ji, H.Bai, C.Ge, J.Yang, and etal.Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation.Advances in Neural Information Processing Systems, 2022.
  • [25]B.Landman, Z.Xu, J.E. Igelsias, M.Styner, and etal.2015 miccai multi-atlas labeling beyond the cranial vault workshop and challenge.In Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, 2015.
  • [26]J.Ma, F.Li, and B.Wang.U-mamba: Enhancing long-range dependency for biomedical image segmentation.arXiv preprint arXiv:2401.04722, 2024.
  • [27]B.H. Menze, A.Jakab, S.Bauer, J.Kalpathy-Cramer, and etal.The multimodal brain tumor image segmentation benchmark (brats).IEEE TMI, 2014.
  • [28]A.Myronenko.3d mri brain tumor segmentation using autoencoder regularization.In BrainLes 2018, Held in Conjunction with MICCAI, 2019.
  • [29]O.Ronneberger, P.Fischer, and T.Brox.U-net: Convolutional networks for biomedical image segmentation.In MICCAI, 2015.
  • [30]S.Roy, G.Koehler, M.Baumgartner, C.Ulrich, J.Petersen, F.Isensee, and K.Maier-Hein.Transformer utilization in medical image segmentation networks.arXiv preprint arXiv:2304.04225, 2023.
  • [31]S.Roy, G.Koehler, C.Ulrich, M.Baumgartner, and etal.Mednext: transformer-driven scaling of convnets for medical image segmentation.In MICCAI, 2023.
  • [32]Y.Tang, D.Yang, W.Li, H.R. Roth, and etal.Self-supervised pre-training of swin transformers for 3d medical image analysis.In CVPR, 2022.
  • [33]A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, and etal.Attention is all you need.NeurIPS, 2017.
  • [34]W.Wang, C.Chen, M.Ding, J.Li, and etal.Transbts: Multimodal brain tumor segmentation using transformer.In MICCAI, 2021.
  • [35]J.Wasserthal, H.-C. Breit, M.Meyer, M.Pradella, d.Hinck, A.W. Sauter, T.Heye, D.T. Boll, J.Cyriac, S.Yang, M.Bach, and M.Segeroth.Totalsegmentator: Robust segmentation of 104 anatomic structures in ct images.Radiol Artif Intell., 2023.
  • [36]Y.Wu, K.Liao, J.Chen, J.Wang, and etal.D-former: A u-shaped dilated transformer for 3d medical image segmentation.Neural Computing and Applications, 2023.
  • [37]Y.Xie, J.Zhang, C.Shen, and Y.Xia.Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation.2021.
  • [38]Z.Xing, T.Ye, Y.Yang, G.Liu, and L.Zhu.Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation.arXiv preprint arXiv:2401.13560, 2024.
  • [39]G.Xu, X.Zhang, X.He, and X.Wu.Levit-unet: Make faster encoders with transformer for medical image segmentation.In PRCV, 2023.
  • [40]Y.Zhang, H.Liu, and Q.Hu.Transfuse: Fusing transformers and cnns for medical image segmentation.In MICCAI, 2021.
  • [41]H.-Y. Zhou, J.Guo, Y.Zhang, L.Yu, and etal.nnformer: Interleaved transformer for volumetric segmentation.arXiv preprint arXiv:2109.03201, 2021.

ModelGPU hoursVRAMEpochsBatch SizePatch SizeSpacingKiTS Fold 0
(GPUs ×\times× MB)DSC [%]
nnU-Net (org.)8.881×6901169011\times 69011 × 6901-2222128×128×128128128128128\times 128\times 128128 × 128 × 1281×0.78×0.7810.780.781\times 0.78\times 0.781 × 0.78 × 0.7886.25
nnU-Net ResEnc M11.391×8805188051\times 88051 × 8805-2222128×128×128128128128128\times 128\times 128128 × 128 × 1281×0.78×0.7810.780.781\times 0.78\times 0.781 × 0.78 × 0.7887.91
nnU-Net ResEnc L35.281×242231242231\times 242231 × 24223-2222160×224×192160224192160\times 224\times 192160 × 224 × 1921×0.78×0.7810.780.781\times 0.78\times 0.781 × 0.78 × 0.7888.60
A3DS SegResNet39.721×202671202671\times 202671 × 202673002222144×224×224144224224144\times 224\times 224144 × 224 × 2241×0.78×0.7810.780.781\times 0.78\times 0.781 × 0.78 × 0.7883.73
A3DS SegResNet61.288×202678202678\times 202678 × 202673008×2828\times 28 × 2144×224×224144224224144\times 224\times 224144 × 224 × 2241×0.78×0.7810.780.781\times 0.78\times 0.781 × 0.78 × 0.7876.81
A3DS SegResNet136.648×202678202678\times 202678 × 202676008×2828\times 28 × 2144×224×224144224224144\times 224\times 224144 × 224 × 2240.78×0.78×0.780.780.780.780.78\times 0.78\times 0.780.78 × 0.78 × 0.7885.60
A3DS SegResNet247.448×398738398738\times 398738 × 398739008×2828\times 28 × 2224×256×256224256256224\times 256\times 256224 × 256 × 2560.78×0.78×0.780.780.780.780.78\times 0.78\times 0.780.78 × 0.78 × 0.7887.77

BTCVACDCLiTSBraTS2021KiTS2023AMOS2022
nnU-Net (org.)2.6%0.8%3.5%0.62%2.0%0.43%
nnU-Net ResEnc M2.4%0.62%2.6%0.67%2.2%0.57%
nnU-Net ResEnc L2.7%0.6%2.4%0.57%1.3%0.59%
nnU-Net ResEnc XL2.7%0.51%2.4%0.62%1.2%0.43%
MedNeXt L k32.1%0.26%2.3%0.66%0.94%0.43%
MedNeXt L k52.0%0.2%2.4%0.59%1.2%0.43%
STU-Net S2.2%0.6%3.3%0.72%1.7%0.42%
STU-Net B2.3%0.78%3.6%1.0%1.9%0.52%
STU-Net L2.6%0.85%2.4%0.62%2.1%0.45%
SwinUNETR2.7%0.65%3.1%0.75%2.0%0.44%
SwinUNETRV22.1%0.51%2.8%0.55%1.7%0.56%
nnFormer2.1%0.21%2.3%0.52%4.2%0.5%
CoTr2.8%0.83%2.8%0.69%1.4%0.64%
No-Mamba Base1.9%0.51%2.9%0.55%2.1%0.32%
U-Mamba Bot2.3%0.59%2.1%0.71%2.7%0.43%
U-Mamba Enc2.3%0.47%1.7%0.64%2.2%0.5%
A3DS SegResNet3.0%0.33%2.7%0.52%1.7%0.48%
A3DS DiNTS3.0%2.2%2.5%0.79%5.3%1.3%
A3DS SwinUNETR1.8%3.6%6.6%0.69%1.5%0.64%
Averages
Intra Method SD2.39%0.79%2.89%0.66%2.07%0.53%
Inter Method SD2.24%2.83%3.80%0.84%9.03%2.52%
Inter/Intra Ratio 94% 357% 132% 127% 435% 474%
Averages w/o A3DS
Intra Method SD2.35%0.56%2.66%0.66%1.93%0.48%
Inter Method SD1.52%0.57%1.68%0.35%3.14%2.28%
Inter/Intra Ratio 65% 102% 63% 53% 163% 477%

ArchitectureLiTSBTCVACDCBraTS2021KiTS2023AMOS2022
nnU-Net (org.) 78.26 85.53 94.93 93.64 82.91 91.49
nnU-Net ResEnc M 79.96 86.01 95.50 93.71 84.10 91.72
nnU-Net ResEnc L 80.39 86.08 95.11 93.59 85.93 92.35
nnU-Net ResEnc XL 79.64 85.89 94.90 93.61 86.49 92.64
MedNeXt L k3 81.07 87.78 96.07 93.85 86.29 92.72
MedNeXt L k5 81.26 88.18 96.09 94.04 85.67 92.86
STU-Net S 76.20 85.13 94.27 93.26 81.08 90.81
STU-Net B 77.33 85.30 94.59 93.54 83.08 91.28
STU-Net L 78.85 85.81 95.12 93.66 83.02 92.30
SwinUNETR 73.06 79.79 94.12 93.16 75.91 85.13
SwinUNETRV2 75.38 82.52 95.15 93.15 80.11 88.47
nnFormer 74.66 82.29 95.83 93.22 69.43 82.93
CoTr 77.25 84.10 93.74 93.49 80.92 90.75
No-Mamba Base 78.88 86.14 95.26 93.64 83.56 92.08
U-Mamba Bot 78.91 86.40 95.40 93.65 83.27 92.00
U-Mamba Enc 78.60 84.60 94.33 93.21 83.64 91.25
A3DS SegResNet 76.46 82.01 93.88 93.40 75.61 89.85
A3DS DiNTS 62.49 77.30 83.67 90.36 58.74 82.75
A3DS SwinUNETR 61.16 74.59 83.94 92.00 46.37 86.93

DatasetMethodPatch SizeSpacingBatch SizeDSC
BTCVnnU-Net ResEnc L80x256x2563x0.76x0.76283.35
nnU-Net ResEnc L (iso)192x192x1921x1x1284.01
MedNeXt L k3128x128x1281x1x1284.70
ACDCnnU-Net ResEnc L20x256x2245x1.56x1.561091.69
nnU-Net ResEnc L (iso)96x256x2561x1x1392.64
MedNeXt L k3128x128x1281x1x1292.65
AMOSnnU-Net ResEnc L96x224x2242x0.71x0.71289.40
nnU-Net ResEnc L (iso)192x192x1921x1x1289.60
MedNeXt L k3128x128x1281x1x1289.62

BTCVACDCLiTSBraTSKiTSAMOSVRAMRTArch.nnU
n=30n=200n=131n=1251n=489n=360[GB][h]
STU-Net L [20]83.3691.3180.3191.2685.8489.3426.5051CNNYes
STU-Net L pretrained [20]84.2891.5381.57088.3289.4626.5051*CNNYes
*Fine-tuning runtime only. Pre-training takes about 4 times longer.

A Call for Rigorous Validation in 3D Medical Image Segmentation (2024)

References

Top Articles
Latest Posts
Article information

Author: Moshe Kshlerin

Last Updated:

Views: 5638

Rating: 4.7 / 5 (77 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Moshe Kshlerin

Birthday: 1994-01-25

Address: Suite 609 315 Lupita Unions, Ronnieburgh, MI 62697

Phone: +2424755286529

Job: District Education Designer

Hobby: Yoga, Gunsmithing, Singing, 3D printing, Nordic skating, Soapmaking, Juggling

Introduction: My name is Moshe Kshlerin, I am a gleaming, attractive, outstanding, pleasant, delightful, outstanding, famous person who loves writing and wants to share my knowledge and understanding with you.