Weakly supervised classification of aort

Dataset

From 2006 to 2010, the UK Biobank recruited 502,638 participants aged 37–73 years in an effort to create a comprehensive, publicly available health-targeted dataset. The initial release of UK Biobank imaging data includes cardiac MRI sequences for 14,328 subjects²¹, including eight cardiac imaging sets. Three sequences of phase-contrast MRI images of the aortic valve registered in an en face view at the sinotubular junction were obtained. Figure 4 shows example MRI videos in all encodings: raw anatomical images (CINE); magnitude (MAG); and velocity encoded (VENC)²². See Supplementary Movies 1–6 for video examples. In MAG and VENC series, pixel intensity directly maps to velocity of blood flow. This is performed by exploiting the difference in phase of the transverse magnetism of protons within blood when flowing parallel to a gradient magnetic field, where the phase difference is proportional to velocity. CINE images encode anatomical information without capturing blood flow. All raw phase contrast MRI sequences are 30 frames, 12-bit grayscale color, and 192 × 192 pixels.

Studies using the UK Biobank are exempt from approval by the Stanford University School of Medicine Institutional Review Board as the data is de-identified and publicly available. Informed consent for use of health information and imaging was performed by the UK Biobank organization at the time of participant enrollment. The UK Biobank ethics committee administered the consent and regulatory compliance for all research participants²³. The collection, distribution, and use of UK Biobank data for non-commercial research purposes is compliant with all relevant regulations including European Union General Data Protection Regulation.

MRI preprocessing

All MRIs were preprocessed to: (1) localize the aortic valve to a 32 × 32 crop image size; and (2) align all image frames by peak blood flow in the cardiac cycle. Since the MAG series directly captures blood flow—and the aorta typically has the most blood flow—both of these steps are straightforward using standard threshold-based image processing techniques when the series is localized to a cross-sectional plane at the sinotubular junction. Selecting the pixel region with maximum standard deviation across all frames localized the aorta, and selecting the frame with maximum standard deviation identified peak blood flow. See Fig. 5 and Supplementary Methods for implementation details. Both heuristics were very accurate (>95% as evaluated on the development set) and selecting a ±7 frame window around the peak frame f_peak captured 99.5% of all pixel variation for the aorta. All three MRI sequences were aligned to this peak before classification.

Gold standard annotations

Gold standard labels were created for 412 patients (12,360 individual MRI frames), with each patient labeled as BAV or TAV, i.e., having two vs. the normal three aortic valve leaflets. We focus our analysis on BAV as it is the easiest malformation to identify from this MRI view. Total annotations included: a development set (100 TAV and 6 BAV patients) for writing labeling functions; a validation set (208 TAV and 8 BAV patients) for model hyperparameter tuning; and a held-out test set (87 TAV and 3 BAV patients) for final evaluation. The development set was selected via chart review of administrative codes (ICD9, ICD10, or OPCS-4) consistent with BAV and followed by manual annotation. The validation and test sets were sampled at random with uniform probability from all 14,328 MRI sequences to capture the BAV class distribution expected at test time. Development and validation set MRIs were annotated by a single cardiologist (J.R.P.). All test set MRIs were annotated by 3 cardiologists (J.R.P., H.C., S.M.) and final labels were assigned based on a majority vote across annotators. For inter-annotator agreement on the test set, Fleiss’s Kappa statistic was 0.354. This reflects a fair level of agreement amongst annotators given the difficulty of the task^24,25. Test data was withheld during all aspects of model development and used solely for the final model evaluation.

Weak supervision

There is considerable research on using indirect or weak supervision to train machine learning models for image and natural language tasks without relying entirely on manually labeled data^9,26,27. One longstanding approach is distant supervision^28,29, where indirect sources of labels are used to generate noisy training instances from unlabeled data. For example, in the ChestX-ray8 dataset³⁰ disorder labels were extracted from clinical assessments found in radiology reports. Unfortunately, we often lack access to indirect labeling resources or, as in the case of BAV, the class of interest itself may be rare and underdiagnosed in existing medical records. Another strategy is to generate noisy labels via crowdsourcing^31,32, which in some medical imaging tasks can perform as well as trained experts^33,34. In practice, however, crowdsourcing is logistically difficult when working with protected health information such as MRIs. A significant challenge in all weakly supervised approaches is correcting for label noise, which can negatively impact end model performance. Noise is commonly addressed using rule-based and generative modeling strategies for estimating the accuracy of label sources^35,36.

In this work, we use the recently proposed data programming⁹ method, a generalization of distant supervision that uses a factor graph-based model to learn both the unobserved accuracies of labeling sources and statistical dependencies between those sources^11,12. In this approach, source accuracy and dependencies are estimated without requiring labeled data, enabling the use of weaker forms of supervision to generate training data, such as using noisy heuristics from clinical experts. For example, in BAV patients the phase-contrast imaging of flow through the aortic valve has a distinct ellipse or asymmetrical triangle appearance compared to the more circular aorta in TAV patients. This is the reasoning a human might apply when directly examining an MRI. In data programming these types of broad, often imperfect domain insights are encoded into functions that vote on the potential class label of unlabeled data points. This allows us to weakly supervise tasks where indirect label sources, such as patient notes with assessments of BAV, are not available.

The idea of encoding domain insights is formalized as labeling functions—black box functions which vote on unlabeled data points. Labeling function output is used to learn a probabilistic label model of the underlying annotation process, where each labeling function is weighted by its estimated accuracy to generate probabilistic training labels y_i ∈ [0, 1]. These probabilistically labeled data are then used to train an off-the-shelf discriminative model such as a deep neural network. The only restriction on labeling functions is that they vote correctly with probability better than random chance. In images, labeling functions are defined over a set of domain features or primitives, semantic abstractions over raw pixel data that enable experts to more easily encode domain heuristics. Primitives encompass a wide range of abstractions, from simple shape features to complex semantic objects such as anatomical segmentation masks. Critically, the final discriminative model learns features from the original MRI sequence, rather than the restricted space of primitives used by labeling functions. This allows the model to generalize beyond the heuristics encoded in labeling functions.

Patient MRIs are represented as a collection of m frames X = {x₁, …, x_m}, where each frame x_i is a 32 × 32 image with MAG, CINE, and VENC encodings mapped to color channels. Each frame is modeled as an unlabeled data point x_i and latent random variable y_i ∈ [−1, 1], corresponding to the true (unobserved) frame label. Supervision is provided as a set of n labeling functions λ₁, …, λ_n that define a mapping λ_j:x_i → Λ_ij where Λ_i1, …, Λ_in is the vector of labeling function votes for sample i. In binary classification, Λ_ij is in the domain {−1, 0, 1}, i.e., {false, abstain, true}, resulting in a label matrix Λ ∈ {−1, 0, 1}^m×n.

The relationship between unobserved labels y and the label matrix Λ is modeled using a factor graph³⁷. We learn a probabilistic model that best explains Λ, i.e., the matrix observed by applying labeling functions to unlabeled data. When labeling function outputs are conditionally independent given the true label, this model consists of n accuracy factors between λ₁, …, λ_n and y

$$\phi _j^{Acc}({\mathrm{\Lambda }}_i,y_i): = y_i{\mathrm{\Lambda }}_{ij}$$

(1)

$$p_\theta ({\bf{\Lambda }},{\bf{Y}}) \propto {\mathrm{exp}}\left( {\mathop {\sum }\limits_{i = 1}^m \mathop {\sum }\limits_{j = 1}^n \theta _j^{Acc}\phi _j^{Acc}({\mathrm{\Lambda }}_i,y_i)} \right)$$

(2)

where Y := y_i, …, y_m, our true labels. The model’s weights θ are estimated by minimizing the negative log likelihood of p_θ(Λ) using contrastive divergence³⁸. Optimization is done using standard stochastic gradient descent with Gibbs sampling for gradient estimation.

In many settings, we encounter statistical dependencies among labeling functions. These dependencies are included in the model by defining additional factors

$$p_\theta ({\bf{\Lambda }},{\bf{Y}}) \propto {\mathrm{exp}}\left( {\mathop {\sum }\limits_{i = 1}^m \mathop {\sum }\limits_{t \in T} \mathop {\sum }\limits_{s \in S_t} \theta _s^t\phi _s^t({\mathrm{\Lambda }}_i,y_i)} \right)$$

(3)

where t ∈ T is a dependency type and S_t are the labeling functions that participate in t. These dependencies may be specified manually if known or learned from unlabeled data.

Automatically learning dependencies from unlabeled data is important in weakly supervised imaging tasks where labeling functions interact with a small set of primitives and have higher order dependency structure. For example, a labeling function defined using the ratio of area and perimeter has dependencies with separate labeling functions for area and perimeter. By expressing supervision using a small space of primitives, we can rely on the Coral method¹¹ to statically analyze labeling function source code and automatically infer complex dependencies among labeling functions based on which primitives they use as input.

The final weak supervision pipeline requires two inputs: (1) primitive feature matrix; and (2) observed label matrix Λ. For generating Λ, we take each patient’s frame sequence $\overline {\bf{x}} _i = \{ x_{1i},...x_{30i}\}$ and apply labeling functions to a window of t frames $\{ x_{({\mathrm{f}}_{{\mathrm{peak}}} - t/2)i},...,x_{({\mathrm{f}}_{{\mathrm{peak}}} + t/2)i}\}$ centered on f_peak, i.e., the frame mapping to peak blood flow. Here t = 6 performed best in our label model experiments. The output of the label model is a set of per frame probabilistic labels {y₁, …, y_(t×N)} where N is the number of patients. To compute a single, per patient probabilistic label, $\bar y_i$, we assign the mean probability of all t patient frames if mean({y_1i, …, y_ti}) > 0.9 and the minimum probability if min({y_1i, …, y_ti}) < 0.5. Patient MRIs that did not meet these thresholds (7%; 304/4543) were removed from the final weak label set. The final weakly labeled training set consists of each 30 frame MRI sequence and a single probabilistic label per-patient: $\widehat {\bf{X}} = \{ \overline {\bf{x}} _i, \ldots ,\overline {\bf{x}} _N\}$ and $\widehat {\bf{Y}} = \{ \bar y_i,...,\bar y_N\}$.

Primitives are generated using existing models or methods for extracting features from image data. In our setting, we restricted primitives to unsupervised shape statistics and pixel intensity features provided by off-the-shelf image analysis tools such as scikit-image³⁹. Primitives are generated using a binarized mask of the aortic valve for each frame in a patient’s MAG series. Since the label model accounts for noise in labeling functions and primitives, we can use unsupervised thresholding techniques such as Otsu’s method⁴⁰ to generate binary masks. All masks were used to compute primitives for: (1) area; (2) perimeter; (3) eccentricity (a [0,1) measure comparing the mask shape to an ellipse, where 0 indicates a perfect circle); (4) pixel intensity (the mean pixel value for the entire mask); and (5) ratio (the ratio of area over perimeter squared). Since the size of the heart and anatomical structures correlate strongly with patient sex, we normalized these features by two population means stratified by sex in the unlabeled set.

We designed 5 labeling functions using the primitives described above. For model simplicity, labeling functions were restricted to using threshold-based, frame-level information for voting. All labeling function thresholds were selected manually using distributional statistics computed over all primitives for the expert-labeled development set. See Supplementary Fig. 2 for the complete development set used for labeling function design and Supplementary Table 1 for labeling function implementations. The final weak supervision pipeline is shown in Fig. 6.

The discriminative model classifies BAV/TAV status using t contiguous MRI frames (5 ≤ t ≤ 30, where t is a hyperparameter) and a single probabilistic label per patient. This model consists of two components: a frame encoder for learning frame-level features and a sequence encoder for combining individual frames into a single feature vector. For the frame encoder, we use a Dense Convolutional Network (DenseNet)⁴¹ with 40 layers and a growth rate of 12, pretrained on 50,000 images from CIFAR-10⁴². We tested two other common pretrained image neural networks (VGG16⁴³, ResNet-50⁴⁴), but found that a DenseNet40-12 model performed best overall, aligning with the previous reports⁴¹. The DenseNet architecture takes advantage of low-level feature maps at all layers, making it well-suited for medical imaging applications where low-level features (e.g., edge detectors) often carry substantial explanatory power.

For the sequence encoder, we used a Bidirectional Long Short-term Memory (LSTM)⁴⁵ sequence model with soft attention⁴⁶ to combine all MRI frame features. The soft attention layer optimizes the weighted mean of frame features, allowing the network to automatically give more weight to the most informative frames in an MRI sequence. We explored simpler feature pooling architectures (e.g., mean/max pooling), but each of these methods was outperformed by the LSTM in our experiments. The final hybrid CNN-LSTM architecture aligns with recent methods for state-of-the-art video classification^47,48 and 3D medical imaging⁴⁹.

The CNN-LSTM model is trained using noise-aware binary cross entropy loss L:

$$\hat w = argmin_w\frac{1}{N}\mathop {\sum }\limits_{i = 1}^N {\Bbb E}_{y\sim \hat Y}\left[ {L(w,x_i,y)} \right]$$

(4)

This is analogous to standard supervised learning loss, except we are now minimizing the expected value with respect to $\hat Y$⁹. This loss enables the discriminative model to take advantage of the more informative probabilistic labels produced by the label model, i.e., training instances with higher probability have more impact on the learned model. Figure 7 shows the complete discriminative model pipeline.

Training and hyperparameter tuning

The development set was used to write all labeling functions and the validation set was used for all model hyperparameter tuning. All models were evaluated with and without data augmentation. Data augmentation is used in deep learning models to increase training set sizes and encode known invariances into the final model by creating transformed copies of existing samples. For example, BAV/TAV status does not change under translation, so generating additional shifted MRI training images does not change the class label, but does improve final classification performance. We used a combination of crops and affine transformations commonly used by state-of-the-art image classifiers⁵⁰. We tested models using all 3 MRI series (CINE, MAG, VENC with a single series per channel) and models using only the MAG series. The MAG series performed best, so we only report those results here.

Hyperparameters were tuned for L2 penalty, dropout, learning rate, and the feature vector size of our last hidden layer before classification. Augmentation hyperparameters were tuned to determine final translation, rotation, and scaling ranges. All models use validation-based early stopping with F1 score as the stopping criterion. The probability threshold for classification was tuned using the validation set for each run to address known calibration issues when using deep learning models⁵¹. Architectures were tuned using a random grid search over 10 models for non-augmented data and 24 for augmented data. See Supplementary Table 2 for full parameter grid settings.

Evaluation metrics

Classification models were evaluated using positive predictive value (precision), sensitivity (recall), F1 score (i.e., the harmonic mean of precision and recall), and AUROC. Due to the extreme class imbalance of this task we also report discounted cumulative gain (DCG) to capture the overall ranking quality of model predictions⁵². Each BAV or TAV case was assigned a relevance weight r of 1 or 0, respectively, and test set patients were ranked by their predicted probabilities. DCG is computed as $\mathop {\sum}\nolimits_i^p {\frac{{r_i}}{{{\mathrm{log}}_2(i + 1)}}}$ where p is the total number of instances and i is the corresponding rank. This score is normalized by the DCG score of a perfect ranking (i.e., all true BAV cases in the top ranked results) to compute normalized DCG (NDCG) in the range [0.0, 1.0]. Higher NDCG scores indicate that the model does a better job of ranking BAV cases higher than TAV cases. All scores were computed using test set data, using the best performing models found during grid search, and reported as the mean and 95% confidence intervals of 5 different random model weight initializations.

For labeling functions, we report two additional metrics: coverage (Eq. (5)) a measure of how many data points a labeling function votes {−1, 1} on; and conflict (Eq. (6)) the percentage of data points where a labeling function disagrees with one or more other labeling functions.

$${\mathrm{coverage}}_{\lambda _j} = \frac{1}{N}\mathop {\sum }\limits_{i = 1}^N {1}\left( {\lambda _j(x_i) \in \{ - 1,1\} } \right)$$

(5)

$${\mathrm{conflict}}_{\lambda _j} = \frac{1}{N}\mathop {\sum }\limits_{i = 1}^N {1}\left( {\mathop {\sum }\limits_{k \ne j}^{\lambda _n} {1}\left( {\lambda _j(x_i) \in \{ - 1,1\} \wedge \lambda _j(x_i) \ne \lambda _k(x_i)} \right)} \right) > 0$$

(6)

Model evaluation with clinical outcomes data

To construct a real-world validation strategy dependent upon the accuracy of image classification but completely independent of the imaging data input, we used model-derived classifications (TAV vs. BAV) as a predictor of validated cardiovascular outcomes using standard epidemiological methods. For 9230 patients with prospectively obtained MRI imaging who were excluded from any aspect of model construction, validation, or testing we performed an ensemble classification with the best performing CNN-LSTM model.

For evaluation we assembled a standard composite outcome of major adverse cardiovascular events (MACE). Phenotypes for MACE were inclusive of the first occurrence of coronary artery disease (myocardial infarction, percutaneous coronary intervention, coronary artery bypass grafting), ischemic stroke (inclusive of transient ischemic attack), heart failure, or atrial fibrillation. These were defined using ICD-9, ICD-10, and OPCS-4 codes from available hospital encounter, death registry, and self-reported survey data of all 500,000 participants of the UK Biobank at enrollment similar to previously reported methods⁵³.

Starting 10 years prior to enrollment in the study, median follow up time for the participants with MRI data included in the analysis was 19 years with a maximum of 22 years. For survival analysis, we employed the “survival” and “survminer” packages in R version 3.4.3, using aortic valve classification as the predictor and time-to-MACE as the outcome, with model evaluation by a simple log-rank test.

To verify the accuracy of the CNN-LSTM’s predicted labels, we generated 2 subsets of our model’s predictions for manual review: (1) 36 randomly chosen MRI sequences (18 TAV and 18 BAV patients); and (3) 100 positive BAV predictions, binned into quartiles by predicted probability. All MRIs were reviewed and labeled by a single annotator (J.R.P.). The output of the last hidden layer was visualized using a t-distributed stochastic neighbor embedding (t-SNE)⁵⁴ plot to assist error analysis.

Related work

In medical imaging, weak supervision refers to a broad range of techniques using limited, indirect, or noisy labels. Multiple instance learning (MIL) is one common weak supervision approach in medical images⁵⁵. MIL approaches assume a label is defined over a bag of unlabeled instances, such as an image-level label being used to supervise a segmentation task. Xu et al.⁵⁶ simultaneously performed binary classification and segmentation for histopathology images using a variant of MIL, where image-level labels are used to supervise both image classification and a segmentation subtask. ChestX-ray8³⁰ was used in Li et al.⁵⁷ to jointly perform classification and localization using a small number of weakly labeled examples. Patient radiology reports and other medical record data are frequently used to generate noisy labels for imaging tasks^30,58,59,60.

Weak supervision shares similarities with semi-supervised learning⁶¹, which enables training models using a small labeled dataset combined with large, unlabeled data. The primary difference is how the structure of unlabeled data is specified in the model. In semi-supervised learning, we make smoothness assumptions and extract insights on structure directly from unlabeled data using task-agnostic properties such as distance metrics and entropy constraints⁶². Weak supervision, in contrast, relies on directly injecting domain knowledge into the model to incorporate the underlying structure of unlabeled data. In many cases, these sources of domain knowledge are readily available in existing knowledge bases, indirectly-labeled data like patient notes, or weak classification models and heuristics.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Weakly supervised classification of aortic valve malformations using unlabeled cardiac MRI sequences | Nature Communications