Technical Considerations for Semantic Segmentation in Magnetic Resonance Imaging using Deep Convolutional Neural Networks: A Case Study in Femoral Cartilage Segmentation
Arjun D. Desai1, Garry E. Gold1,2,3, Brian A. Hargreaves1,2,4, and Akshay S. Chaudhari1

1Radiology, Stanford University, Stanford, CA, United States, 2Bioengineering, Stanford University, Stanford, CA, United States, 3Orthopedic Surgery, Stanford University, Stanford, CA, United States, 4Electrical Engineering, Stanford University, Stanford, CA, United States


Deep convolutional neural networks (CNNs) have shown promise in challenging tissue segmentation problems in medical imaging. However, due to the large size of these networks and stochasticity of the training process, the factors affecting CNN performance are difficult to analytically model. In this study, we numerically evaluate the impact of network architecture and characteristics of training data on network performance for segmenting femoral cartilage. We show that extensive training of several common network architectures yields comparable performance and that somewhat optimal network generalizability can be achieved with limited training data.


Semantic segmentation is a critical task for localizing anatomy and identifying pathology in MR images. However, manual segmentation of tissue structures is a cumbersome process prone to inter-reader variations1. Recent advances in deep-learning and convolutional neural networks (CNNs) may decrease computational time and eliminate inter-reader variability2,3,4. However, due to the challenge in analytically modeling CNNs, it is difficult to evaluate the generalizability of these networks.

In this study, we performed segmentation of femoral articular cartilage as an archetype for tissues that are challenging to segment as cartilage has a volumetric, thin, curved structure and limited imaging contrast. We systematically evaluate how variations in network architecture and training data impact the generalizability of CNNs for segmentation of femoral cartilage in knee MRI.



3D sagittal double-echo in steady-state (DESS) volumes and corresponding femoral cartilage segmented masks were acquired from the iMorphics dataset5. 88 patients, scanned at two time points one year apart, were randomly split into disjoint cohorts of 60 patients for training, 14 for validation, and 14 for testing, resulting in 120 (training), 28 (validation), and 28 (testing) volumes. An approximately equal distribution of Kellgren-Lawrence (KL) grades was maintained among all three groups. During training, network weights were initialized using He initialization6.

Base Network Architecture

We utilized three general 2D CNN architectures that use variations of the encoder-decoder framework: U-Net7, SegNet8, and DeeplabV3+9 (Fig.1a-c). All architectures were trained from scratch (20-epochs) and correspondingly fine-tuned (20-epochs), each with empirically optimized hyperparameters (Fig.1d).

Volumetric Architecture

To evaluate whether 3D inputs improve spatial localization of tissues, we developed two renditions of the 2D U-Net architecture: the 2.5D and 3D U-Nets. The 2.5D U-Net utilizes a N-to-1 mapping, taking N consecutive input slices and generating a 2D segmentation on the central slice. The 3D U-Net applies an N-to-N mapping. Three versions of the 2.5D U-Net were evaluated (N=3,5,7). The 3D U-Net was trained using 32 consecutive slices (N=32) to maximize the through-plane receptive field.

Loss Function

We evaluated three training loss functions, which are popular in semantic segmentation due to their robustness to class imbalances, for training instances of the 2D U-Net: dice loss, binary cross-entropy (BCE), and weighted cross-entropy (WCE). Class weights for WCE were empirically determined by the inverse of the relative frequencies of background and cartilage pixels in the training set.

Data Augmentation

We performed four-fold physiologically plausible data augmentation for 2D slices using random scaling, shear, contrast adjustment, and motion blurring to compare 2D U-Net performance on augmented versus non-augmented training data. To equalize the number of backpropagation steps, the non-augmented network was trained 5x longer (100 epochs) than the augmented network.

Data Limitation

We trained the three 2D network architectures on varying extents of subsampled training data. The original training set (consisting of 60 patients) was randomly sampled (with replacement) to create 4 sub-training sets of 5, 15, 30, and 60 patients with similar distributions of KL grades. These networks were trained for 240, 80, 40, and 20 epochs, respectively, to ensure that all networks had an equal number of backpropagation steps.

Statistical Analysis

Dice score coefficients (DSC) and volumetric overlap errors (VOE) were used to quantify segmentation accuracy with reference to manual segmentations. Kruskal-Wallis tests, and corresponding Dunn post-hoc tests, (α=0.05) were used to quantify significance measures between different network instances.


The 2D U-Net significantly outperformed SegNet as well as 2.5D and 3D U-Net volumetric models (p<0.01). The U-Net model using dice loss had higher DSC and VOE performance compared to WCE (p<0.01) and BCE. Data augmentation resulted in a significant (p<0.01) segmentation improvement. Comprehensive results are shown in Fig. 2. Retrospective subsampling of training data showed a power-law trend between number of patients (training) and segmentation accuracy (Fig.3). To map segmentation accuracy to anatomical regions in patients of variable knee sizes, each slice was normalized to the field-of-view (Fig.4).


All models with different base architectures displayed low precision to cartilage regions in edge and medial-lateral transition slices, while having high fidelity in central slices (Fig.5). While class-weighting is commonly performed, results on the model trained using WCE loss demonstrated low accuracy in classifying non-cartilage regions, suggesting that weighted losses favor false positives over false negatives. The power-law relationship between training data and segmentation accuracy suggests an asymptotic limit for network performance, corroborating previous studies on non-medical images10.


In this study, we demonstrate the tradeoffs in optimizing CNN performance using variations in network architecture and training data characteristics. While robust networks can be optimally trained with limited data, these results also suggest an inherent performance limit for CNNs.


Research support provided by NIH AR0063643, NIH EB002524, NIH AR062068, NIH EB017739, NIH EB015891, and Philips.


[1] Eckstein F, Kwoh CK, Link TM, investigators O. Imaging research results from the Osteoarthritis Initiative (OAI): a review and lessons learned 10 years after start of enrolment. Annals of the rheumatic diseases 2014:annrheumdis-2014-205310.

[2] Avendi M, Kheradvar A, Jafarkhani H. A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Medical image analysis 2016;30:108-119.

[3] Liu F, Zhou Z, Jang H, Samsonov A, Zhao G, Kijowski R. Deep convolutional neural network and 3D deformable approach for tissue segmentation in musculoskeletal magnetic resonance imaging. Magnetic resonance in medicine 2018;79(4):2379-2391.

[4] Pereira S, Pinto A, Alves V, Silva CA. Brain tumor segmentation using convolutional neural networks in MRI images. IEEE transactions on medical imaging 2016;35(5):1240-1251.

[5] Peterfy C, Schneider E, Nevitt M. The osteoarthritis initiative: report on the design rationale for the magnetic resonance imaging protocol for the knee. Osteoarthritis and cartilage 2008;16(12):1433-1441.

[6] He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. 2015. p 1026-1034.

[7] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. 2015. Springer. p 234-241.

[8] Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:151100561 2015.

[9] Chen L-C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:170605587 2017.

[10] Hestness J, Narang S, Ardalani N, Diamos G, Jun H, Kianinejad H, Patwary M, Ali M, Yang Y, Zhou Y. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:171200409 2017.


Fig. 1. 2D base deep convolutional neural network architectures that use an encoder-decoder framework for semantic segmentation. U-Net (A) uses skip connections to concatenate weights from the encoder to the decoder, while SegNet (B) passes pooling indices to the decoder to reduce the computational complexities of weight concatenation. DeeplabV3+ (C) uses spatial pyramid pooling to extract latent feature vectors at multiple fields of view. The parameters used for training each model (D) were selected empirically for each network. The largest mini-batch size that could be loaded on the Titan Xp GPU was used for each network.

Fig. 2. Table detailing mean (and standard deviation) segmentation performance with dice score coefficient (DSC) and volumetric overlap error (VOE) for comparative experiments. Models with significantly superior performance (p<0.01) among all models in the same experiment category are bolded.

Fig. 3. Performance of U-Net, SegNet, and DeeplabV3+ (DLV3+) when trained on retrospectively subsampled training data. The plots (log-x scale) and corresponding R2 values indicate a power-law relationship between segmentation performance, as measured by the dice score coefficient (DSC) and volumetric overlap error (VOE), and the number of training patients for all networks. These results suggest an empirical maximum performance limit for networks given a fixed training time (# epochs). Experiments were repeated 3 times with fixed Python random seeds to ensure reproducibility of the results.

Fig. 4. The dice score coefficient (DSC) corresponding to differences in network architectures (A), U-Net volumetric architectures (B), training loss functions (C), and data augmentation (D) are plotted against the normalized (0-100%) field-of-view (FOV) in the volume. 0% corresponds to the medial compartment (100%-lateral). FOV was calculated per test scan as the region between the initial and final slices where femoral cartilage appeared in the ground-truth, manual segmentation. As patient knee sizes are variable, the normalized FOV maps segmentation accuracy to general regions in the knee. Segmentation accuracy decreases at edge slices (FOV: <7.5%,>92.5%) and medial-lateral transition regions (FOV: 50-65%).

Fig 5. Sample segmentations from three FCN architectures (U-Net, SegNet, DeeplabV3+) with true-positive (green), false-positive (blue), and false-negative (red) overlays. Despite statistically significant difference between the performance of U-Net and the other two architectures, there is minimal visual variation between network outputs. Failures occur in regions of thin, disjoint femoral cartilage common in edge (A) and medial-lateral transition slices (C). However, thick, continuous cartilaginous regions (B) have considerably better performance throughout the entire region, including edge pixels. (C) shows all networks successfully handled challenging slices that include cartilage lesions, heterogenous signal, and proximity to anatomy with similar signal (ACL, fluid).

Proc. Intl. Soc. Mag. Reson. Med. 27 (2019)