Repeatability and Reproducibility of brain volume measurements with SPM and Freesurfer  and their impact on subtle between-group differences
Letizia Palumbo1, Paolo Bosco1, Elisa Ferrari2, Piernicola Oliva3, Giovanna Spera1, and Alessandra Retico1

1National Institute for Nuclear Physics (INFN), Pisa, Italy, 2Scuola Normale Superiore, Pisa, Italy, 3University of Sassary and INFN Cagliari Division, Sassari, Italy


The main aim of the study is to investigate whether the adoption of a processing method has a relevant influence on the results of a neuroimaging research. We evaluated the intra-method repeatability and the inter-method reproducibility of two widely-used automatic segmentation methods for brain MRI: FreeSurfer (FS) and Statistical Parametric Mapping (SPM) software packages. We segmented the gray matter, the white matter and the subcortical structures in test-retest MRI data of healthy volunteers from two publicly available datasets. High intra-method repeatability was found for both SPM and FS, but SPM was more consistent than FS in measuring ROIs volumes.


Segmenting an MR brain image into different brain structures is a widely used pre-processing step in neuroscience research. In order to investigate large data sets, several automated segmentation tools are commonly used to segment the brain regions in a reasonable amount of time. However, the reliability of their measurements is still a matter of debate. The lack of inter-method agreement can produce inconsistent results in neuroimaging studies. We analyzed the intra-method repeatability, the inter-method reproducibility and we looked for systematic biases in the volume estimates of two processing methods, SPM121 and FS2, focusing on six regions of interest (ROI), global measures (GM and WM) and subcortical structures (hippocampus, putamen, caudate and brainstem). The reliability of each method and the quantification of such discrepancies is important for the correct interpretation of the results of longitudinal studies and for quantitative considerations in meta analyses. In addition, we tested whether the choice of different segmentation pre-processing affects the results of volume group comparisons, by assessing the differences in brain measures between male and female subgroups.


We examined two publicly available data samples: the Kirby-21 (Kirby) dataset3, and the OASIS dataset4. The first one consists of 3D T1-weighted images of 21 healthy volunteers (11 males and 10 females; age: 32±9 years) acquired using a 3T MRI scanner. The considered OASIS dataset consists of 3D T1-weighted images of 20 healthy volunteers (10 males and 10 females; age: 23.4 ± 3.9 years), acquired using a 1.5 T MRI scanner. To enable reproducibility studies, MRI images were acquired twice with the same acquisition parameters after a short time for the Kirby dataset and with a time delay in the range of 1-89 days for the OASIS dataset. Firstly, we quantified the intra-method repeatability of SPM and FS in estimating brain tissues volume through a test-retest analysis. Secondly, we evaluated the inter-method reproducibility of estimated volumes comparing the two processing methods and quantifying the overlap of the segmented regions. Finally, we compared the brain volume measures obtained with each software for the male and female subsamples, in order to reveal gender-related volume differences. These analyses were conducted in parallel for the two data samples assessing Pearson’s correlation, Bland-Altman plot representation, Cohen’s d effect size and Dice similarity index.


We found out high Pearson’s correlation (0.98-0.99) between the volumes obtained on test and retest MRI data analyzed with both SPM and FS. The Bland-Altman plots detected the presence of systematic biases between the test-retest measures for the GM and WM volumes obtained by FS on the OASIS database. For these quantities, the test measures were systematically greater (4.6±1.3)% and smaller (-2.8±1.2)% than the retest measures, respectively. The comparison between the volumes provided by SPM and those by FS highlighted that FS overestimated the volumes with respect to SPM, except for the GM volume. These findings were consistently detected on both Kirby and OASIS data samples as shown in Table 1. We reported in Figure 1. the overlays of the ROI masks obtained by SPM and FS for the worst cases of the Kirby and the OASIS data samples, respectively. Dice values were found in the 0.76-0.83 range. In the male vs. female brain volume comparisons, inconsistencies arose for the OASIS dataset, where the gender-related differences appear subtler with respect to the Kirby dataset. In particular, gender-related differences on the Kirby dataset were consistently detected in the analysis of volumes segmented by SPM and FS (Cohen’s effect size > 1.1), whereas, in the case of the OASIS dataset significant volume differences were not consistently detected in the analysis of volumes segmented by SPM and by FS, and in that case the Cohen’s effect size are generally lower (see Table 2).


SPM seems to be more robust with respect to any variation that can have occurred between test and retest scans on that data sample. The inter-method reproducibility analysis revealed discrepancies between SPM and FS calculated volumes, visible both in the Bland-Altman plots and in terms of the Dice indices. These differences can be due to the implementation of the different segmentation algorithms, including the adoption of different reference atlases. Additional sources of discrepancy in segmenting subcortical structures are can be due to the arbitrariness in defining their boundaries with respect to surrounding structures.


We support SPM as the more consistent tool to evaluate ROI volumes. In any case, as the two methods rely on different algorithm pipelines, which can be differently affected by the presence of abnormalities, image artifacts, or variations in the acquisition protocol parameters, we suggest cross-validating the findings of each research study against different segmentation methods before to proceed to their interpretation.


The OASIS project was funded by grants P50 AG05681, P01 AG03991, R01 AG021910, P50 MH071616, U24 RR021382, R01 MH56584.This work has been partially funded by the Tuscany Government (Bando FAS Salute by Sviluppo Toscana, ARIANNA Project), and by the National Institute of Nuclear Physics (nextMR project). Conflict of Interest Statement: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.


1.SPM. Available at https://www.fil.ion.ucl.ac.uk/spm/, 2018.

2.FreeSurfer [online]. Available at https://surfer.nmr.mgh.harvard.edu/. Accessed 15 June 2011

3.Landman BA, Huang AJ, Gifford A, Vikram DS, Lim I, Farelli J, Smith S: Multi-parametric neuroimaging reproducibility: a 3-T resource study. Neuroimage 54: 2854–2866, 2011

4.Marcus, Daniel S, Tracy H, Wang: Open Access Series of Imaging Studies (OASIS): Cross-Sectional MRI Data in Young, Middle Aged, Nondemented, and Demented Older Adults. Journal of Cognitive Neuroscience 19(9): 1498–1507, 2007


Table1. Inter-method reproducibility measurements between SPM and FS evaluated on the Kirby and OASIS data samples: Bland-Altman parameters e.g. mean (d) and standard deviation (s) of percent difference and limits of agreement corresponding to the 95% C.I. in volume difference. The parameter d is computed as the difference between SPM and FS volumes over their average.

Figure1. Overlay of segmented ROIs by SPM and by FS onto a single subject anatomical image in the native space for the worst case of the Kirby-21 data sample(left) and of the OASIS data sample (right). GM and WM are visible on the first line; hippocampus, putamen and caudate, brainstem on the second and third lines, respectively.

Table2. Gender differences in brain structures volumes of interest [ml] for SPM and FS evaluated on the Kirby and OASIS dataset: mean, standard deviation (SD) and statistic measures, t, p_value and Cohen’s d.

Proc. Intl. Soc. Mag. Reson. Med. 27 (2019)