Repeatability of radiomics features in double baseline MR imaging of glioblastoma
Katharina V Hoebel1,2, Andrew L Beers1, James M Brown1, Ken Chang1,2, Jay B Patel1,2, Marco C Pinho1, Bruce R Rosen1, Tracy T Batchelor3, Elizabeth R Gerstner3, and Jayashree Kalpathy-Cramer1

1Athinoula A. Martinos Center for Biomedical Imaging, Charlestown, MA, United States, 2Harvard-MIT Division of Health Sciences and Technology, Cambridge, MA, United States, 3Stephen E. and Catherine Pappas Center for Neuro-Oncology, Massachusetts General Hospital, Boston, MA, United States


Extraction of radiomic features has to be repeatable in order to be clinically useful. We investigated the repeatability of radiomic feature extraction on a unique dataset consisting of a double baseline MRI scans in 48 patients diagnosed with glioblastoma. Size and shape features which are mostly governed by tumor segmentation showed on average higher repeatability than intensity and texture-based features which are more dependent on image acquisition and preprocessing. More research on the influence of image acquisition and preprocessing on the repeatability and reliability of radiomic features has to be undertaken to make radiomics a safe image-analysis tool.


Radiomics is a method to quantify localized pathologies in medical images, such as tumors, and determine their ‘phenotypic characteristics’. These characteristics come in the form of quantitative descriptors extracted from voxelwise regions of interest delineating each given pathological tissue.1 Although features are assumed to reliably capture biological properties like tumor heterogeneity, they are mostly based on the relationship between voxel intensities and the shape of the manually or (semi-)automatically determined region of interest. As many pathological diseases are monitored longitudinally, only repeatable and reproducible radiomic features should be used for treatment planning.2 Given the existence of several potential sources of visit-to-visit variability within typical radiomics pipelines, we aimed to assess the repeatability of radiomic feature extraction using a unique double-baseline magnetic resonance imaging (MRI) dataset acquired from two clinical trials.


Patients: We evaluated imaging from 48 patients from two clinical trials at our institutions (NCT00756106, NCT00662506). The patients were newly diagnosed with glioblastoma and underwent two baseline scans 3 to 5 days apart (mean 3.7 days) prior to start of treatment (double baseline). No tumor showed significant progression between the two baseline scans as measured by change in contrast enhancing tumor volume (T1W post-contrast) or T2-weighted fluid/attenuated inversion recovery (T2W-FLAIR) hyperintensity.3

Image Acquisition and Feature Extraction: Scans were acquired using identical imaging protocols on a 3.0-T MRI System (TimTrio; Siemens Medical Solutions, Malvern, PA). Reproducible slice positioning was ensured using AutoAlign. For further analysis, we used T2W-FLAIR and contrast enhanced T1W (T1W post-contrast) sequences (5 mm slice thickness, 1 mm interslice gap, 0.43 mm in-plane resolution for both sequences). Radiomics features were extracted using two independent, open-source python packages: pyradiomics4 and qtim_tools5. Images were skull stripped and package default image normalization was applied as part of the feature extraction process. Feature calculation was based on manual segmentations of the whole tumor from T2W-FLAIR and enhancing tumor on T1W post-contrast sequences by expert raters.

Analysis: Features were grouped into size, shape, intensity, and texture features.6 Intraclass correlation coefficient (ICC) between feature values of the first and second baseline visit was determined using R (package: IRR; two-way model, type: consistency, unit: single, confidence level: 0.95).


For each sequence and visit we extracted 94 radiomics features using pyradiomics and 50 using qtim_tools (table 1). Feature calculation was based on expert manual segmentations of the whole tumor region on the T2W-FLAIR image, and enhancing tumor on the T1W post-contrast image. Figure 1 illustrates the distribution of ICCs for both packages and both sequences (T2W FLAIR and T1W post-contrast) from which radiomics features were extracted. Features are subdivided into groups based on similar characteristics, namely: size, shape, intensity, and texture. Size and shape feature groups showed on average the highest ICC values and lowest variability. Intensity and texture features showed greater variation of ICC within the feature groups and on average lower ICC values compared to size and shape features. Since the texture feature class was quite diverse for pyradiomics, we evaluated three subcategories of texture features. Average ICC values for the single texture subgroups are (T2W-FLAIR/T1W post-contrast): GLCM 0.28/0.62, GLSZM: 0.41/0.44, GLRLM: 0.71/0.72.


In this study we have analyzed the repeatability of radiomics feature extraction for two open-source radiomics packages. Feature extraction was performed on a unique dataset of double baseline scans of a patient cohort newly diagnosed with glioblastoma. The physicians performing manual segmentation confirmed that between both scans no significant change in tumor volume and shape appearance could be detected (average time interval 3.7 days, no treatment). Both scans for each patient were acquired on the same MRI machine with secured alignment and manual segmentation of the tumor was performed by the same physician to limit variations.

Features in the size and shape groups that are less influenced by image acquisition and preprocessing (such as normalization) showed good test-retest reliability supporting the physicians’ findings. However, intensity and texture-based features showed a great variability in their ICC. This result indicates that the measurement of some of the features in these groups cannot be reliably repeated.


In this study on the repeatability of radiomics feature extraction in a unique double-baseline dataset exhibiting no biologically meaningful changes between visits, changes in extracted features seemed to be caused by variations in image acquisition and preprocessing. These variations are challenging to control for even in a study setting. Based on our findings, future research should address the causes for poor repeatability of radiomic feature extraction and development of more robust features and methods for quantitative image analysis like deep learning.


This publication was supported from the Martinos Scholars fund to Katharina Hoebel. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Martinos Scholars fund.

This project was supported by a training grant from the NIH Blueprint for Neuroscience Research (T90DA022759/R90DA023427) and the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health under award number 5T32EB1680 to K. Chang and J. Patel. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

This study was supported by National Institutes of Health grants U01 CA154601, U24 CA180927, and U24 CA180918 to J. Kalpathy-Cramer.

We would like to acknowledge the GPU computing resources provided by the MGH and BWH Center for Clinical Data Science.

This research was carried out in whole or in part at the Athinoula A. Martinos Center for Biomedical Imaging at the Massachusetts General Hospital, using resources provided by the Center for Functional Neuroimaging Technologies, P41EB015896, a P41 Biotechnology Resource Grant supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB), National Institutes of Health.


1 Lambin, P., Rios-Velazquez, E., Leijenaar, R., Carvalho, S., Van Stiphout, R. G. P. M., Granton, P., … Aerts, H. J. W. L. (2012). Radiomics: Extracting more information from medical images using advanced feature analysis. European Journal of Cancer, 48(4), 441–446. http://doi.org/10.1016/j.ejca.2011.11.036

2 Traverso, A., Wee, L., Dekker, A., & Gillies, R. (2018). Repeatability and Reproducibility of Radiomic Features: A Systematic Review. International Journal of Radiation Oncology*Biology*Physics, 102(4), 1143–1158. http://doi.org/10.1016/J.IJROBP.2018.05.053

3 Batchelor, T. T., Gerstner, E. R., Emblem, K. E., Duda, D. G., Kalpathy-Cramer, J., Snuderl, M., … Jain, R. K. (2013). Improved tumor oxygenation and survival in glioblastoma patients who show increased blood perfusion after cediranib and chemoradiation. Proceedings of the National Academy of Sciences, 110(47), 19059–19064. http://doi.org/10.1073/pnas.1318022110

4 Van Griethuysen, J. J. M., Fedorov, A., Parmar, C., Hosny, A., Aucoin, N., Narayan, V., … Aerts, H. J. W. L. (2017). Computational radiomics system to decode the radiographic phenotype. Cancer Research, 77(21), e104–e107. http://doi.org/10.1158/0008-5472.CAN-17-0339

5 qtim_tools; https://github.com/QTIM-Lab/qtim_tools

6 Kalpathy-Cramer, J., Mamomov, A., Zhao, B., Lu, L., Cherezov, D., Napel, S., … Goldgof, D. (2016). Radiomics of Lung Nodules: A Multi-Institutional Study of Robustness and Agreement of Quantitative Imaging Features. Tomography, 2(4), 430–437. http://doi.org/10.18383/j.tom.2016.00235


Distribution of ICC values after grouping by feature category. A: pyradiomics on T2W-FLAIR, B: pyradiomics on T1W post-contrast, C: qtim_tools on T2W-FLAIR, D : qtim_tools on T1W post-contrast.

Number of radiomic features listed by group and package.

Proc. Intl. Soc. Mag. Reson. Med. 27 (2019)