Arjun D. Desai^{1}, Garry E. Gold^{1,2,3}, Brian A. Hargreaves^{1,2,4}, and Akshay S. Chaudhari^{1}

Deep convolutional neural networks (CNNs) have shown promise in challenging tissue segmentation problems in medical imaging. However, due to the large size of these networks and stochasticity of the training process, the factors affecting CNN performance are difficult to analytically model. In this study, we numerically evaluate the impact of network architecture and characteristics of training data on network performance for segmenting femoral cartilage. We show that extensive training of several common network architectures yields comparable performance and that somewhat optimal network generalizability can be achieved with limited training data.

Semantic
segmentation is a critical task for localizing anatomy and identifying
pathology in MR images. However, manual segmentation of tissue structures is a
cumbersome process prone to inter-reader variations^{1}. Recent advances in deep-learning and convolutional
neural networks (CNNs) may decrease computational time and eliminate
inter-reader variability^{2,3,4}. However, due to the challenge in analytically
modeling CNNs, it is difficult to evaluate the generalizability of these networks.

In this study, we performed segmentation of femoral articular cartilage as an archetype for tissues that are challenging to segment as cartilage has a volumetric, thin, curved structure and limited imaging contrast. We systematically evaluate how variations in network architecture and training data impact the generalizability of CNNs for segmentation of femoral cartilage in knee MRI.

*Dataset*

3D sagittal double-echo in steady-state (DESS) volumes and corresponding femoral cartilage segmented masks were acquired from the iMorphics dataset^{5}. 88 patients, scanned at two time points one year apart, were randomly split into disjoint cohorts of 60 patients for training, 14 for validation, and 14 for testing, resulting in 120 (training), 28 (validation), and 28 (testing) volumes. An approximately equal distribution of Kellgren-Lawrence (KL) grades was maintained among all three groups. During training, network weights were initialized using He initialization^{6}.

*Base Network Architecture*

We utilized three general 2D CNN architectures that use variations of the encoder-decoder framework: U-Net^{7}, SegNet^{8}, and DeeplabV3+^{9} (Fig.1a-c). All architectures were trained from scratch (20-epochs) and correspondingly fine-tuned (20-epochs), each with empirically optimized hyperparameters (Fig.1d).

*Volumetric Architecture*

To evaluate whether 3D inputs improve spatial localization of tissues, we developed two renditions of the 2D U-Net architecture: the 2.5D and 3D U-Nets. The 2.5D U-Net utilizes a N-to-1 mapping, taking N consecutive input slices and generating a 2D segmentation on the central slice. The 3D U-Net applies an N-to-N mapping. Three versions of the 2.5D U-Net were evaluated (N=3,5,7). The 3D U-Net was trained using 32 consecutive slices (N=32) to maximize the through-plane receptive field.

*Loss Function*

We evaluated three training loss functions, which are popular in semantic segmentation due to their robustness to class imbalances, for training instances of the 2D U-Net: dice loss, binary cross-entropy (BCE), and weighted cross-entropy (WCE). Class weights for WCE were empirically determined by the inverse of the relative frequencies of background and cartilage pixels in the training set.

*Data Augmentation*

We performed four-fold physiologically plausible data augmentation for 2D slices using random scaling, shear, contrast adjustment, and motion blurring to compare 2D U-Net performance on augmented versus non-augmented training data. To equalize the number of backpropagation steps, the non-augmented network was trained 5x longer (100 epochs) than the augmented network.

*Data Limitation*

We trained the three 2D network architectures on varying extents of subsampled training data. The original training set (consisting of 60 patients) was randomly sampled (with replacement) to create 4 sub-training sets of 5, 15, 30, and 60 patients with similar distributions of KL grades. These networks were trained for 240, 80, 40, and 20 epochs, respectively, to ensure that all networks had an equal number of backpropagation steps.

*Statistical Analysis*

Dice score coefficients (DSC) and volumetric overlap errors (VOE) were used to quantify segmentation accuracy with reference to manual segmentations. Kruskal-Wallis tests, and corresponding Dunn post-hoc tests, (α=0.05) were used to quantify significance measures between different network instances.

[1] Eckstein F, Kwoh CK, Link TM, investigators O. Imaging research results from the Osteoarthritis Initiative (OAI): a review and lessons learned 10 years after start of enrolment. Annals of the rheumatic diseases 2014:annrheumdis-2014-205310.

[2] Avendi M, Kheradvar A, Jafarkhani H. A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Medical image analysis 2016;30:108-119.

[3] Liu F, Zhou Z, Jang H, Samsonov A, Zhao G, Kijowski R. Deep convolutional neural network and 3D deformable approach for tissue segmentation in musculoskeletal magnetic resonance imaging. Magnetic resonance in medicine 2018;79(4):2379-2391.

[4] Pereira S, Pinto A, Alves V, Silva CA. Brain tumor segmentation using convolutional neural networks in MRI images. IEEE transactions on medical imaging 2016;35(5):1240-1251.

[5] Peterfy C, Schneider E, Nevitt M. The osteoarthritis initiative: report on the design rationale for the magnetic resonance imaging protocol for the knee. Osteoarthritis and cartilage 2008;16(12):1433-1441.

[6] He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. 2015. p 1026-1034.

[7] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. 2015. Springer. p 234-241.

[8] Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:151100561 2015.

[9] Chen L-C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:170605587 2017.

[10] Hestness J, Narang S, Ardalani N, Diamos G, Jun H, Kianinejad H, Patwary M, Ali M, Yang Y, Zhou Y. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:171200409 2017.

Fig. 1. 2D base deep convolutional neural
network architectures that use an encoder-decoder framework for semantic
segmentation. U-Net (A) uses skip
connections to concatenate weights from the encoder to the decoder, while
SegNet (B) passes pooling indices to the decoder to reduce the computational
complexities of weight concatenation. DeeplabV3+ (C) uses spatial pyramid
pooling to extract latent feature vectors at multiple fields of view. The
parameters used for training each model (D) were selected empirically for each
network. The largest mini-batch size that could be loaded on the Titan Xp GPU
was used for each network.

Fig. 2. Table detailing mean (and standard
deviation) segmentation performance with dice score coefficient (DSC) and
volumetric overlap error (VOE) for comparative experiments. Models with
significantly superior performance (p<0.01) among *all* models in the same experiment category are bolded.

Fig. 3. Performance of U-Net, SegNet, and
DeeplabV3+ (DLV3+) when trained on retrospectively subsampled training data.
The plots (log-x scale) and corresponding R^{2} values indicate a
power-law relationship between segmentation performance, as measured by the dice
score coefficient (DSC) and volumetric overlap error (VOE), and the number of training
patients for all networks. These results suggest an empirical maximum
performance limit for networks given a fixed training time (# epochs).
Experiments were repeated 3 times with fixed Python random seeds to ensure
reproducibility of the results.

Fig. 4. The dice score coefficient (DSC) corresponding to differences in network architectures (A), U-Net volumetric architectures
(B), training loss functions (C), and data augmentation (D) are plotted against
the normalized (0-100%) field-of-view (FOV) in the volume. 0% corresponds to the
medial compartment (100%-lateral). FOV was calculated per test scan as the
region between the initial and final slices where femoral cartilage appeared in
the ground-truth, manual segmentation.
As patient knee sizes are variable, the normalized FOV maps segmentation
accuracy to general regions in the knee. Segmentation accuracy decreases at
edge slices (FOV: <7.5%,>92.5%) and medial-lateral transition regions
(FOV: 50-65%).

Fig 5. Sample segmentations from three
FCN architectures (U-Net, SegNet, DeeplabV3+) with true-positive (green), false-positive
(blue), and false-negative (red) overlays. Despite statistically significant
difference between the performance of U-Net and the other two architectures,
there is minimal visual variation between network outputs. Failures occur in
regions of thin, disjoint femoral cartilage common in edge (A) and
medial-lateral transition slices (C). However, thick, continuous cartilaginous
regions (B) have considerably better performance throughout the entire region,
including edge pixels. (C) shows all networks successfully handled challenging slices
that include cartilage lesions, heterogenous signal, and proximity to anatomy
with similar signal (ACL, fluid).