Dataset
From 2006 to 2010, the UK Biobank recruited 502,638 participants aged 37–73 years in an effort to create a comprehensive, publicly available health-targeted dataset. The initial release of UK Biobank imaging data includes cardiac MRI sequences for 14,328 subjects21, including eight cardiac imaging sets. Three sequences of phase-contrast MRI images of the aortic valve registered in an en face view at the sinotubular junction were obtained. Figure 4 shows example MRI videos in all encodings: raw anatomical images (CINE); magnitude (MAG); and velocity encoded (VENC)22. See Supplementary Movies 1–6 for video examples. In MAG and VENC series, pixel intensity directly maps to velocity of blood flow. This is performed by exploiting the difference in phase of the transverse magnetism of protons within blood when flowing parallel to a gradient magnetic field, where the phase difference is proportional to velocity. CINE images encode anatomical information without capturing blood flow. All raw phase contrast MRI sequences are 30 frames, 12-bit grayscale color, and 192 × 192 pixels.

Example MRI sequence data for BAV and TAV subjects. a Uncropped MRI frames for CINE, MAG, and VENC series in an oblique coronal view of the thorax centered upon an en face view of the aortic valve at sinotubular junction (red boxes). b 15-frame subsequence of a phase-contrast MRI for all series, with peak frame outlined in blue. MAG frames at peak flow for 12 patients, broken down by class: c bicuspid aortic valve (BAV) and d tricuspid aortic valve (TAV)
Studies using the UK Biobank are exempt from approval by the Stanford University School of Medicine Institutional Review Board as the data is de-identified and publicly available. Informed consent for use of health information and imaging was performed by the UK Biobank organization at the time of participant enrollment. The UK Biobank ethics committee administered the consent and regulatory compliance for all research participants23. The collection, distribution, and use of UK Biobank data for non-commercial research purposes is compliant with all relevant regulations including European Union General Data Protection Regulation.
MRI preprocessing
All MRIs were preprocessed to: (1) localize the aortic valve to a 32 × 32 crop image size; and (2) align all image frames by peak blood flow in the cardiac cycle. Since the MAG series directly captures blood flow—and the aorta typically has the most blood flow—both of these steps are straightforward using standard threshold-based image processing techniques when the series is localized to a cross-sectional plane at the sinotubular junction. Selecting the pixel region with maximum standard deviation across all frames localized the aorta, and selecting the frame with maximum standard deviation identified peak blood flow. See Fig. 5 and Supplementary Methods for implementation details. Both heuristics were very accurate (>95% as evaluated on the development set) and selecting a ±7 frame window around the peak frame fpeak captured 99.5% of all pixel variation for the aorta. All three MRI sequences were aligned to this peak before classification.

Aorta localization. a Uncropped MAG series MRI frame, showing 0–1 normalized, per pixel standard deviation. b Green box is a zoom of the heart region and the red box corresponds to the aorta—the highest weighted pixel area in the image. c A box and whisker plot of per-frame standard deviations for all 4239 MRI sequences in the weak training set. Here the blue box represents the interquartile range of the first and third quartiles, the black line is the median value, and the whiskers map to the minimum and maximum values across all frames at a given index. Note the most variation occurs in the first 15 frames
Gold standard annotations
Gold standard labels were created for 412 patients (12,360 individual MRI frames), with each patient labeled as BAV or TAV, i.e., having two vs. the normal three aortic valve leaflets. We focus our analysis on BAV as it is the easiest malformation to identify from this MRI view. Total annotations included: a development set (100 TAV and 6 BAV patients) for writing labeling functions; a validation set (208 TAV and 8 BAV patients) for model hyperparameter tuning; and a held-out test set (87 TAV and 3 BAV patients) for final evaluation. The development set was selected via chart review of administrative codes (ICD9, ICD10, or OPCS-4) consistent with BAV and followed by manual annotation. The validation and test sets were sampled at random with uniform probability from all 14,328 MRI sequences to capture the BAV class distribution expected at test time. Development and validation set MRIs were annotated by a single cardiologist (J.R.P.). All test set MRIs were annotated by 3 cardiologists (J.R.P., H.C., S.M.) and final labels were assigned based on a majority vote across annotators. For inter-annotator agreement on the test set, Fleiss’s Kappa statistic was 0.354. This reflects a fair level of agreement amongst annotators given the difficulty of the task24,25. Test data was withheld during all aspects of model development and used solely for the final model evaluation.
Weak supervision
There is considerable research on using indirect or weak supervision to train machine learning models for image and natural language tasks without relying entirely on manually labeled data9,26,27. One longstanding approach is distant supervision28,29, where indirect sources of labels are used to generate noisy training instances from unlabeled data. For example, in the ChestX-ray8 dataset30 disorder labels were extracted from clinical assessments found in radiology reports. Unfortunately, we often lack access to indirect labeling resources or, as in the case of BAV, the class of interest itself may be rare and underdiagnosed in existing medical records. Another strategy is to generate noisy labels via crowdsourcing31,32, which in some medical imaging tasks can perform as well as trained experts33,34. In practice, however, crowdsourcing is logistically difficult when working with protected health information such as MRIs. A significant challenge in all weakly supervised approaches is correcting for label noise, which can negatively impact end model performance. Noise is commonly addressed using rule-based and generative modeling strategies for estimating the accuracy of label sources35,36.
In this work, we use the recently proposed data programming9 method, a generalization of distant supervision that uses a factor graph-based model to learn both the unobserved accuracies of labeling sources and statistical dependencies between those sources11,12. In this approach, source accuracy and dependencies are estimated without requiring labeled data, enabling the use of weaker forms of supervision to generate training data, such as using noisy heuristics from clinical experts. For example, in BAV patients the phase-contrast imaging of flow through the aortic valve has a distinct ellipse or asymmetrical triangle appearance compared to the more circular aorta in TAV patients. This is the reasoning a human might apply when directly examining an MRI. In data programming these types of broad, often imperfect domain insights are encoded into functions that vote on the potential class label of unlabeled data points. This allows us to weakly supervise tasks where indirect label sources, such as patient notes with assessments of BAV, are not available.
The idea of encoding domain insights is formalized as labeling functions—black box functions which vote on unlabeled data points. Labeling function output is used to learn a probabilistic label model of the underlying annotation process, where each labeling function is weighted by its estimated accuracy to generate probabilistic training labels yi ∈ [0, 1]. These probabilistically labeled data are then used to train an off-the-shelf discriminative model such as a deep neural network. The only restriction on labeling functions is that they vote correctly with probability better than random chance. In images, labeling functions are defined over a set of domain features or primitives, semantic abstractions over raw pixel data that enable experts to more easily encode domain heuristics. Primitives encompass a wide range of abstractions, from simple shape features to complex semantic objects such as anatomical segmentation masks. Critically, the final discriminative model learns features from the original MRI sequence, rather than the restricted space of primitives used by labeling functions. This allows the model to generalize beyond the heuristics encoded in labeling functions.
Patient MRIs are represented as a collection of m frames X = {x1, …, xm}, where each frame xi is a 32 × 32 image with MAG, CINE, and VENC encodings mapped to color channels. Each frame is modeled as an unlabeled data point xi and latent random variable yi ∈ [−1, 1], corresponding to the true (unobserved) frame label. Supervision is provided as a set of n labeling functions λ1, …, λn that define a mapping λj:xi → Λij where Λi1, …, Λin is the vector of labeling function votes for sample i. In binary classification, Λij is in the domain {−1, 0, 1}, i.e., {false, abstain, true}, resulting in a label matrix Λ ∈ {−1, 0, 1}m×n.
The relationship between unobserved labels y and the label matrix Λ is modeled using a factor graph37. We learn a probabilistic model that best explains Λ, i.e., the matrix observed by applying labeling functions to unlabeled data. When labeling function outputs are conditionally independent given the true label, this model consists of n accuracy factors between λ1, …, λn and y
$$\phi _j^{Acc}({\mathrm{\Lambda }}_i,y_i): = y_i{\mathrm{\Lambda }}_{ij}$$
(1)
$$p_\theta ({\bf{\Lambda }},{\bf{Y}}) \propto {\mathrm{exp}}\left( {\mathop {\sum }\limits_{i = 1}^m \mathop {\sum }\limits_{j = 1}^n \theta _j^{Acc}\phi _j^{Acc}({\mathrm{\Lambda }}_i,y_i)} \right)$$
(2)
where Y := yi, …, ym, our true labels. The model’s weights θ are estimated by minimizing the negative log likelihood of pθ(Λ) using contrastive divergence38. Optimization is done using standard stochastic gradient descent with Gibbs sampling for gradient estimation.
In many settings, we encounter statistical dependencies among labeling functions. These dependencies are included in the model by defining additional factors
$$p_\theta ({\bf{\Lambda }},{\bf{Y}}) \propto {\mathrm{exp}}\left( {\mathop {\sum }\limits_{i = 1}^m \mathop {\sum }\limits_{t \in T} \mathop {\sum }\limits_{s \in S_t} \theta _s^t\phi _s^t({\mathrm{\Lambda }}_i,y_i)} \right)$$
(3)
where t ∈ T is a dependency type and St are the labeling functions that participate in t. These dependencies may be specified manually if known or learned from unlabeled data.
Automatically learning dependencies from unlabeled data is important in weakly supervised imaging tasks where labeling functions interact with a small set of primitives and have higher order dependency structure. For example, a labeling function defined using the ratio of area and perimeter has dependencies with separate labeling functions for area and perimeter. By expressing supervision using a small space of primitives, we can rely on the Coral method11 to statically analyze labeling function source code and automatically infer complex dependencies among labeling functions based on which primitives they use as input.
The final weak supervision pipeline requires two inputs: (1) primitive feature matrix; and (2) observed label matrix Λ. For generating Λ, we take each patient’s frame sequence \(\overline {\bf{x}} _i = \{ x_{1i},...x_{30i}\}\) and apply labeling functions to a window of t frames \(\{ x_{({\mathrm{f}}_{{\mathrm{peak}}} - t/2)i},...,x_{({\mathrm{f}}_{{\mathrm{peak}}} + t/2)i}\}\) centered on fpeak, i.e., the frame mapping to peak blood flow. Here t = 6 performed best in our label model experiments. The output of the label model is a set of per frame probabilistic labels {y1, …, y(t×N)} where N is the number of patients. To compute a single, per patient probabilistic label, \(\bar y_i\), we assign the mean probability of all t patient frames if mean({y1i, …, yti}) > 0.9 and the minimum probability if min({y1i, …, yti}) < 0.5. Patient MRIs that did not meet these thresholds (7%; 304/4543) were removed from the final weak label set. The final weakly labeled training set consists of each 30 frame MRI sequence and a single probabilistic label per-patient: \(\widehat {\bf{X}} = \{ \overline {\bf{x}} _i, \ldots ,\overline {\bf{x}} _N\}\) and \(\widehat {\bf{Y}} = \{ \bar y_i,...,\bar y_N\}\).
Primitives are generated using existing models or methods for extracting features from image data. In our setting, we restricted primitives to unsupervised shape statistics and pixel intensity features provided by off-the-shelf image analysis tools such as scikit-image39. Primitives are generated using a binarized mask of the aortic valve for each frame in a patient’s MAG series. Since the label model accounts for noise in labeling functions and primitives, we can use unsupervised thresholding techniques such as Otsu’s method40 to generate binary masks. All masks were used to compute primitives for: (1) area; (2) perimeter; (3) eccentricity (a [0,1) measure comparing the mask shape to an ellipse, where 0 indicates a perfect circle); (4) pixel intensity (the mean pixel value for the entire mask); and (5) ratio (the ratio of area over perimeter squared). Since the size of the heart and anatomical structures correlate strongly with patient sex, we normalized these features by two population means stratified by sex in the unlabeled set.
We designed 5 labeling functions using the primitives described above. For model simplicity, labeling functions were restricted to using threshold-based, frame-level information for voting. All labeling function thresholds were selected manually using distributional statistics computed over all primitives for the expert-labeled development set. See Supplementary Fig. 2 for the complete development set used for labeling function design and Supplementary Table 1 for labeling function implementations. The final weak supervision pipeline is shown in Fig. 6.

Weak supervision workflow. Pipeline for probabilistic training label generation based on user-defined primitives and labeling functions. Primitives and labeling functions (step 1) are used to weakly supervise the BAV classification task and programmatically generate probabilistic training data from large collections of unlabeled MRI sequences (step 2), which are then used to train a noise-aware deep learning classification model (step 3)
The discriminative model classifies BAV/TAV status using t contiguous MRI frames (5 ≤ t ≤ 30, where t is a hyperparameter) and a single probabilistic label per patient. This model consists of two components: a frame encoder for learning frame-level features and a sequence encoder for combining individual frames into a single feature vector. For the frame encoder, we use a Dense Convolutional Network (DenseNet)41 with 40 layers and a growth rate of 12, pretrained on 50,000 images from CIFAR-1042. We tested two other common pretrained image neural networks (VGG1643, ResNet-5044), but found that a DenseNet40-12 model performed best overall, aligning with the previous reports41. The DenseNet architecture takes advantage of low-level feature maps at all layers, making it well-suited for medical imaging applications where low-level features (e.g., edge detectors) often carry substantial explanatory power.
For the sequence encoder, we used a Bidirectional Long Short-term Memory (LSTM)45 sequence model with soft attention46 to combine all MRI frame features. The soft attention layer optimizes the weighted mean of frame features, allowing the network to automatically give more weight to the most informative frames in an MRI sequence. We explored simpler feature pooling architectures (e.g., mean/max pooling), but each of these methods was outperformed by the LSTM in our experiments. The final hybrid CNN-LSTM architecture aligns with recent methods for state-of-the-art video classification47,48 and 3D medical imaging49.
The CNN-LSTM model is trained using noise-aware binary cross entropy loss L:
$$\hat w = argmin_w\frac{1}{N}\mathop {\sum }\limits_{i = 1}^N {\Bbb E}_{y\sim \hat Y}\left[ {L(w,x_i,y)} \right]$$
(4)
This is analogous to standard supervised learning loss, except we are now minimizing the expected value with respect to \(\hat Y\)9. This loss enables the discriminative model to take advantage of the more informative probabilistic labels produced by the label model, i.e., training instances with higher probability have more impact on the learned model. Figure 7 shows the complete discriminative model pipeline.

Deep neural network for MRI sequence classification. Each MRI frame is encoded by the DenseNet into a feature vector fxi. These frame features are fed in sequentially to the LSTM sequence encoder, which uses a soft attention layer to learn a weighted mean embedding of all frames Semb. This forms the final feature vector used for binary classification
Training and hyperparameter tuning
The development set was used to write all labeling functions and the validation set was used for all model hyperparameter tuning. All models were evaluated with and without data augmentation. Data augmentation is used in deep learning models to increase training set sizes and encode known invariances into the final model by creating transformed copies of existing samples. For example, BAV/TAV status does not change under translation, so generating additional shifted MRI training images does not change the class label, but does improve final classification performance. We used a combination of crops and affine transformations commonly used by state-of-the-art image classifiers50. We tested models using all 3 MRI series (CINE, MAG, VENC with a single series per channel) and models using only the MAG series. The MAG series performed best, so we only report those results here.
Hyperparameters were tuned for L2 penalty, dropout, learning rate, and the feature vector size of our last hidden layer before classification. Augmentation hyperparameters were tuned to determine final translation, rotation, and scaling ranges. All models use validation-based early stopping with F1 score as the stopping criterion. The probability threshold for classification was tuned using the validation set for each run to address known calibration issues when using deep learning models51. Architectures were tuned using a random grid search over 10 models for non-augmented data and 24 for augmented data. See Supplementary Table 2 for full parameter grid settings.
Evaluation metrics
Classification models were evaluated using positive predictive value (precision), sensitivity (recall), F1 score (i.e., the harmonic mean of precision and recall), and AUROC. Due to the extreme class imbalance of this task we also report discounted cumulative gain (DCG) to capture the overall ranking quality of model predictions52. Each BAV or TAV case was assigned a relevance weight r of 1 or 0, respectively, and test set patients were ranked by their predicted probabilities. DCG is computed as \(\mathop {\sum}\nolimits_i^p {\frac{{r_i}}{{{\mathrm{log}}_2(i + 1)}}}\) where p is the total number of instances and i is the corresponding rank. This score is normalized by the DCG score of a perfect ranking (i.e., all true BAV cases in the top ranked results) to compute normalized DCG (NDCG) in the range [0.0, 1.0]. Higher NDCG scores indicate that the model does a better job of ranking BAV cases higher than TAV cases. All scores were computed using test set data, using the best performing models found during grid search, and reported as the mean and 95% confidence intervals of 5 different random model weight initializations.
For labeling functions, we report two additional metrics: coverage (Eq. (5)) a measure of how many data points a labeling function votes {−1, 1} on; and conflict (Eq. (6)) the percentage of data points where a labeling function disagrees with one or more other labeling functions.
$${\mathrm{coverage}}_{\lambda _j} = \frac{1}{N}\mathop {\sum }\limits_{i = 1}^N {1}\left( {\lambda _j(x_i) \in \{ - 1,1\} } \right)$$
(5)
$${\mathrm{conflict}}_{\lambda _j} = \frac{1}{N}\mathop {\sum }\limits_{i = 1}^N {1}\left( {\mathop {\sum }\limits_{k \ne j}^{\lambda _n} {1}\left( {\lambda _j(x_i) \in \{ - 1,1\} \wedge \lambda _j(x_i) \ne \lambda _k(x_i)} \right)} \right) > 0$$
(6)
Model evaluation with clinical outcomes data
To construct a real-world validation strategy dependent upon the accuracy of image classification but completely independent of the imaging data input, we used model-derived classifications (TAV vs. BAV) as a predictor of validated cardiovascular outcomes using standard epidemiological methods. For 9230 patients with prospectively obtained MRI imaging who were excluded from any aspect of model construction, validation, or testing we performed an ensemble classification with the best performing CNN-LSTM model.
For evaluation we assembled a standard composite outcome of major adverse cardiovascular events (MACE). Phenotypes for MACE were inclusive of the first occurrence of coronary artery disease (myocardial infarction, percutaneous coronary intervention, coronary artery bypass grafting), ischemic stroke (inclusive of transient ischemic attack), heart failure, or atrial fibrillation. These were defined using ICD-9, ICD-10, and OPCS-4 codes from available hospital encounter, death registry, and self-reported survey data of all 500,000 participants of the UK Biobank at enrollment similar to previously reported methods53.
Starting 10 years prior to enrollment in the study, median follow up time for the participants with MRI data included in the analysis was 19 years with a maximum of 22 years. For survival analysis, we employed the “survival” and “survminer” packages in R version 3.4.3, using aortic valve classification as the predictor and time-to-MACE as the outcome, with model evaluation by a simple log-rank test.
To verify the accuracy of the CNN-LSTM’s predicted labels, we generated 2 subsets of our model’s predictions for manual review: (1) 36 randomly chosen MRI sequences (18 TAV and 18 BAV patients); and (3) 100 positive BAV predictions, binned into quartiles by predicted probability. All MRIs were reviewed and labeled by a single annotator (J.R.P.). The output of the last hidden layer was visualized using a t-distributed stochastic neighbor embedding (t-SNE)54 plot to assist error analysis.
Related work
In medical imaging, weak supervision refers to a broad range of techniques using limited, indirect, or noisy labels. Multiple instance learning (MIL) is one common weak supervision approach in medical images55. MIL approaches assume a label is defined over a bag of unlabeled instances, such as an image-level label being used to supervise a segmentation task. Xu et al.56 simultaneously performed binary classification and segmentation for histopathology images using a variant of MIL, where image-level labels are used to supervise both image classification and a segmentation subtask. ChestX-ray830 was used in Li et al.57 to jointly perform classification and localization using a small number of weakly labeled examples. Patient radiology reports and other medical record data are frequently used to generate noisy labels for imaging tasks30,58,59,60.
Weak supervision shares similarities with semi-supervised learning61, which enables training models using a small labeled dataset combined with large, unlabeled data. The primary difference is how the structure of unlabeled data is specified in the model. In semi-supervised learning, we make smoothness assumptions and extract insights on structure directly from unlabeled data using task-agnostic properties such as distance metrics and entropy constraints62. Weak supervision, in contrast, relies on directly injecting domain knowledge into the model to incorporate the underlying structure of unlabeled data. In many cases, these sources of domain knowledge are readily available in existing knowledge bases, indirectly-labeled data like patient notes, or weak classification models and heuristics.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.