A Simple Fully Automated Method for Skull-Stripping Quality Control in Brain MR Image Processing Pipelines Evaluated Using Multicenter Data
Till Huelnhagen1,2,3, Ricardo Corredor-Jerez1,2,3, Claudia Bigoni1, Veronica Ravano1, Mário João Fartaria1,2,3, Adrian Tsang4, Rodrigo D. Perea4, Sara Makaretz4, Maria Laura Blefari4, Yuchuan Zhuang4, Bénédicte Maréchal1,2,3, Elizabeth Fisher4, and Tobias Kober1,2,3

1Advanced Clinical Imaging Technology, Siemens Healthcare AG, Lausanne, Switzerland, 2Department of Radiology, Centre Hospitalier Universitaire Vaudois (CHUV), Lausanne, Switzerland, 3Signal Processing Laboratory (LTS 5), École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland, 4Biogen, Cambridge, MA, United States


Automated brain segmentation approaches are increasingly being used for decision support in routine clinical settings. While segmentation may be considered a “solved problem” in research, it is still challenging to assure reliable performance of automated tools in clinical settings, which is a crucial requirement for diagnostic tools. To ensure correct results, automated quality control procedures are of vital importance, but they are often difficult to implement or time-consuming to run. We propose a simple and fast fully automated method to detect segmentation errors, and we evaluate its performance to detect skull-stripping-errors using results of two different brain segmentation algorithms on a large multicenter dataset. Results show that the method is able to detect skull-stripping-errors with high specificity.


Automated brain segmentation approaches are increasingly being used in routine clinical settings, in particular for neurodegenerative diseases like multiple sclerosis (MS) and Alzheimer’s, for which brain atrophy is a marker of disease progression1,2. Reliability of segmentation results is essential for clinical applicability, but depends on many factors. Automated segmentation algorithms will never be absolutely perfect; these uncertainties can however be managed as long as segmentation errors are automatically detected. Segmentation quality metrics can be used for this purpose, but require a ground truth, which is rarely available. Registration of images to reference templates is a viable option to overcome this limitation3, but the processing can be time-consuming, and results can be difficult to interpret. The goal of this work was to develop a simple and fast, fully automated pipeline for detection of errors in brain segmentation. The performance of the approach for detecting skull-stripping errors, a common source of unsatisfactory results, was evaluated on a large multicenter dataset including images from healthy subjects and MS patients and compared to human reader scores.


The workflow is illustrated in Figure 1. The skull-stripped input image is first registered to a reference image in a template space. Here, the ICBM 2009c atlas4 was used as reference, and full affine transforms were employed5,6. Affine registration was deliberately chosen to prevent the registration from “correcting” the segmentation errors to be detected. Using the obtained transform, a binary mask of the image is transformed into the template space where it is compared to a reference mask by means of similarity metrics. Figure 2 shows the procedure using an example case. Dice coefficient and Hausdorff distance (HDD) were employed due to their complementary sensitivity to segmentation errors; in principle, however, depending on the use case, also other similarity metrics could be applied. Test data was used to define cutoff values for both metrics to classify segmentations as good or bad. The intended use for the pipeline was to reject failed segmentations prior to visual review. Thus thresholds were selected to optimize specificity and minimize false positives.

To evaluate the performance of the pipeline, 364 3D-MP-RAGE (TR=2300ms,TI=900ms,matrix-size=240x256x176;voxel-size=1×1×1mm3) and 3D-FLAIR scans (TR=5000ms,TI=1800ms,240x256x176;voxel-size=1×1×1mm3) were acquired from 181 subjects (146 MS patients, 35 healthy controls) from ten institutions using different 3T scanners (MAGNETOM Verio,Skyra,Prisma,Prismafit,Trio,Vida and BIOGRAPH mMR, all Siemens Healthcare,Erlangen,Germany) covering a large variety of brain size and anatomy. The standardized images were collected as part of the Multiple Sclerosis Partners Advancing Technology and Healthcare Solutions (MS PATHS) project7.

Brain segmentation was performed using two different approaches: A FLAIR-based algorithm (a modified version of autosegMS8,9) and an MP-RAGE-based prototype method (MorphoBox10,11). Both methods provided a skull-stripped image (autosegMS: brain outer contour volume, OCV; MorphoBox: total intracranial volume, TIV). Corresponding reference templates were used. For both methods, skull-stripping and brain segmentation quality were visually reviewed and rated as follows:

  • GOOD: no or minor segmentation errors
  • MODEREATE: some segmentation errors
  • BAD: unacceptable


Figure 3 shows the distribution of quality metrics with respect to the visual assessment scores. Cutoff values were set to 19(20.5) for HDD and 0.915(0.9275) for Dice for autosegMS(MorphoBox). For autosegMS, the proposed methods detected “BAD” segmentations with sensitivity of 40% and specificity of 97.7% (Figure 3A). For the MorphoBox segmentations, sensitivity was 16% and specificity 99.4% (Figure 3B). It should be noted that the manual scoring rated also brain segmentation errors and not just skull-stripping errors. This explains the rather low sensitivity. The sensitivity for detecting “BAD” MorphoBox segmentation results also flagged as “TIV_too_big” was 58.3%, while specificity was 99.4% (Figure 3C). Figure 4 illustrates examples of skull-stripping errors together with the corresponding metric values. Processing time was approximately 30 seconds/case on a modern desktop computer.

Discussion and Conclusions

We have proposed a simple and fast method for fully automated detection of skull-stripping errors. Due to the short processing time, the method can be easily implemented in clinical workflows and is feasible for use with large datasets. In the current implementation, the method is limited to skull-stripping-errors; other segmentation errors could be detected by using additional template masks, e.g. brain and sub-structures (depending on the segmentation algorithm). The rationale of this work emphasizes high specificity to detect bad skull-stripping without accidently excluding good cases, a requirement for clinical use of such a method. If needed, further kinds of segmentation errors could be added to increase sensitivity. The proposed method is not limited to brain MRI scans, but can also be applied to other body parts and imaging modalities, as long as corresponding reference templates can be provided. To conclude, the proposed method provides a simple and fast way to detect skull-stripping errors and can easily be adapted for other segmentation problems.


No acknowledgement found.


1. Storelli L, Rocca MA, Pagani E, et al. Measurement of Whole-Brain and Gray Matter Atrophy in Multiple Sclerosis: Assessment with MR Imaging. Radiology 2018:172468 doi: 10.1148/radiol.2018172468

2. Ledig C, Schuh A, Guerrero R, Heckemann RA, Rueckert D. Structural brain imaging in Alzheimer’s disease and mild cognitive impairment: biomarker analysis and shared morphometry database. Sci. Rep. 2018;8:11258 doi: 10.1038/s41598-018-29295-9

3. Alfaro-Almagro F, Jenkinson M, Bangerter NK, et al. Image processing and Quality Control for the first 10,000 brain imaging datasets from UK Biobank. Neuroimage 2018 doi: 10.1016/j.neuroimage.2017.10.034

4. Fonov V, Evans A, McKinstry R, Almli C, Collins D. Unbiased nonlinear average age-appropriate brain templates from birth to adulthood. Neuroimage 2009 doi: 10.1016/S1053-8119(09)70884-5

5. Klein S, Staring M, Murphy K, Viergever MA, Pluim JPW. elastix: a toolbox for intensity-based medical image registration. IEEE Trans. Med. Imaging 2010;29:196–205 doi: 10.1109/TMI.2009.2035616

6. Shamonin DP, Bron EE, Lelieveldt BPF, Smits M, Klein S, Staring M. Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer’s disease. Front. Neuroinform. 2014;7:50

7. Multiple Sclerosis Partners Advancing Technology and Healthcare Solutions (MS PATHS). https://www.mspaths.com

8. Fisher E, Cothren RM, Tkach JA, Masaryk TJ, Cornhill JF. Knowledge-based 3D segmentation of the brain in MR images for quantitative multiple sclerosis lesion tracking. In: SPIE Proc. Medical Imaging: Image Processing. ; 1997. pp. 19–25

9. Rudick RA, Fisher E, Lee JC, Simon J, Jacobs L. Use of the brain parenchymal fraction to measure whole brain atrophy in relapsing-remitting MS. Multiple Sclerosis Collaborative Research Group. Neurology 1999;53:1698–1704

10. Schmitter D, Roche A, Maréchal B, et al. An evaluation of volume-based morphometry for prediction of mild cognitive impairment and Alzheimer’s disease. NeuroImage Clin. 2015;7:7–17 doi: 10.1016/j.nicl.2014.11.001

11. Roche A, Forbes F. Partial Volume Estimation in Brain MRI Revisited BT - Medical Image Computing and Computer-Assisted Intervention – MICCAI 2014: 17th International Conference, Boston, MA, USA, September 14-18, 2014, Proceedings, Part I. In: Golland P, Hata N, Barillot C, Hornegger J, Howe R, editors. Cham: Springer International Publishing; 2014. pp. 771–778. doi: 10.1007/978-3-319-10404-1_96


Figure 1: Processing workflow: An input image is processed by a skull stripping algorithm providing the TIV (total intracranial volume) image and corresponding mask. The TIV image is then registered to a reference space. The resulting transform is used to transform the TIV mask into the reference space. The transformed mask is compared to a reference mask in template space by means of similarity metrics. Thresholds for the quality metrics are applied to detect skull stripping errors.

Figure 2: Example of TIV mask in image and template space. Original MPRAGE image with TIV mask overlay (top), TIV mask registered to template space overlaid to template brain (top), Reference TIV mask in template space (bottom).

Figure 3: Distribution of Hausdorff Distance and Dice coefficient related to manual visual assessment scores. autosegMS (A), MorphoBox (B), MorphoBox, but only considering cases as bad if they are also marked as having too large TIV (C). The dashed red lines indicate the cutoff values used for the classification of skull stripping errors.

Figure 4: Examples of good (top) and erroneous skull stripping (center, bottom) and corresponding similarity metric values. Large local deviations of the mask are detected by the Hausdorff distance (center) while less pronounced but global errors are more reflected in the Dice coefficient (bottom).

Proc. Intl. Soc. Mag. Reson. Med. 27 (2019)