Poster
Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?
Pedro R. A. S. Bassi · Wenxuan Li · Yucheng Tang · Fabian Isensee · Zifu Wang · Jieneng Chen · Yu-Cheng Chou · Tassilo Wald · Constantin Ulrich · Michael Baumgartner · Saikat Roy · Klaus Maier-Hein · Paul Jaeger · Yiwen Ye · Yutong Xie · Jianpeng Zhang · Ziyang Chen · Yong Xia · Yannick Kirchhoff · Maximilian R. Rokuss · Pengcheng Shi · Ting Ma · Yuxin Du · Fan BAI · Tiejun Huang · Bo Zhao · Zhaohu Xing · Lei Zhu · Saumya Gupta · Haonan Wang · Xiaomeng Li · Ziyan Huang · Jin Ye · Junjun He · Yousef Sadegheih · Afshin Bozorgpour · Pratibha Kumari · Reza Azad · Dorit Merhof · Hanxue Gu · Haoyu Dong · Jichen Yang · Maciej Mazurowski · Linshan Wu · Jia-Xin Zhuang · Hao CHEN · Holger Roth · Daguang Xu · Matthew Blaschko · Sergio Decherchi · Andrea Cavalli · Alan Yuille · Zongwei Zhou
West Ballroom A-D #5206
How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have underlying problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. We address this misalignment issue with Touchstone, a large-scale collaborative benchmark for medical segmentation. This benchmark is based on annotated CT datasets of unprecedented scale: 5,195 training volumes from 76 medical institutions around the world, and 6,933 testing volumes from 8 additional hospitals. This extensive and diverse test set not only makes the benchmark results more statistically meaningful than existing ones, but also systematically tests AI algorithms in varied out-of-distribution scenarios. We invited 14 inventors of various AI algorithms, categorized as CNN, Transformer, and their combinations, to train their algorithms on the publicly available training set. Our team, as a third party, independently evaluated these algorithms on the test set and reported their pros/cons from multiple perspectives. In addition, we also evaluated publicly available AI frameworks---which are more flexible and can support different algorithms---including MONAI and its Auto3DSeg from NVIDIA, nnU-Net from DKFZ, and numerous other open-source repositories such as vision-language framework developed by researchers. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.
Live content is unavailable. Log in and register to view live content