Jiacheng Jason He^{1}, Christopher Sandino^{1}, Shreyas Vasanawala^{1,2}, and Joseph Cheng^{2}

This work demonstrates the use of recurrent generative spatiotemporal autoencoders to predict up to fifteen future frames of abdominal DCE-MRI video data, starting with only three ground truth input frames for context. The objective is to predict what healthy patient video data and organ-specific contrast curves look like, to expedite anomaly detection and enable pulse sequence optimization. The model in this study shows promise; it was able to learn contrast changes without losing structural resolution during training time, and lays the foundation for future work.

Four-dimensional data from 100 patient volunteers are used; 70 patients for training, 10 for validation, and 20 for testing. Data was acquired using a T_{1}-weighted, 3D spoiled gradient recalled (SPGR) sequence on a GE 3T MR750 scanner, with a 32-channel cardiac coil.^{3} Several preprocessing steps are taken: slices are zero-padded in the x- and y-directions so all images are the same dimensions, and the data is normalized. For initial proof of concept, only image magnitude is used. The first three ground truth frames are concatenated along the channel dimension for the first input. During training time, additional ground truth frames are used to predict each timestep. During validation time, predictions are concatenated to make the next predictions.

The autoencoder architecture is shown in Figure 1. The objective of the encoder is to embed the key features of the inputs, from which the decoder produces the next time frame. The encoder is based on a VGG16 model without fully-connected layers at the end.^{4} The decoder upsamples and deconvolves, and adds a residual skip connection from the previous corresponding layers during the encoding network to assist with learning structural similarities. The loss function is a weighted mean squared error (MSE), with a higher weighting on the earlier frame predictions. In TensorFlow, we use the Adam optimizer to minimize dependence on learning rate tuning.

The three primary models are: a recurrent three-to-one (R3T1) model, a recurrent five-to-one (R5T1) model, and a non-recurrent three-to-one (N3T1) model. The R3T1 model is modelled after typical recurrent neural network (RNN) architectures, where the encoder and decoder share weights from timestep to timestep. The R5T1 model is an experiment to evaluate the benefits of more context frames. The N3T1 model is an unrolled model, which trains independent weights per timestep, which should theoretically improve performance at the cost of increased weights.

- Feng X, Meyer C. Accelerating cardiac dynamic imaging with video prediction using deep predictive coding networks. ISMRM. 2018.
- Srivastava N, Mansimov E, Salakhutdinov R. Unsupervised learning of video representations using LSTMs. Proceedings of International Conference on Machine Learning (ICML). 2015.
- Zhang T, Cheng JY, Potnick AG, et al. Fast pediatric 3D free-breathing abdominal dynamic contrast-enhanced MRI with high spatiotemporal resolution. J Magn Reson Imaging. 2015;41:460-473.
- Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [cs.LG]. 2015.

Figure 1. The encoder-decoder architecture used to generate predictions at each timestep. The encoder is based on the VGG16 architecture without fully-connected layers and produces the (N,6,7,1) embedding. The decoder deconvolves and upsamples encodings, using skip connections to improve structural preservation.

Figure 2. The top set is the R3T1 model, the middle set is the R5T1 model, and the bottom set is the N3T1 model, all at training time. In each set, the top row is the ground truth, the middle row is the output of the model, and the bottom row is the difference images.Since training data was randomly loaded, it is impossible to compare the same time series, but the difference images show that all three perform well during training.

Figure 3. Results for two slices at validation time, one on the top row, one on the bottom. From left to right: ground truth, R3T1, R5T1, N3T1. The animation shows how each changes over time. All three deep predictive models perform well at first, but slowly they deviate from the ground truth. The N3T1 model preserves structure the best, since it had independently trained autoencoders for each timestep.

Figure 4. The contrast enhancement curve for the liver, with selected, segmented timesteps shown. The liver was manually segmented and the average signal was calculated at each timestep. Qualitatively, the shape of the curve for N3T1 looks best, albeit offset by a constant value.

Figure 5. The validation and test MSE plotted over time for each model. This shows, quantitatively, that N3T1 was the best model.