1. INTRODUCTION
DCE‐MRI is currently the method of choice to measure the permeability of the blood–brain barrier (BBB). 1 Numerous neurological conditions have been investigated, including brain tumors, 2 multiple sclerosis, 3 traumatic brain injury, 4 stroke, 5 and in individuals with Alzheimer's pathology. 6 Quantitative physiological maps of BBB permeability (Ve and Ktrans) and plasma volume (Vp) can be estimated from DCE‐MRI data using pharmacokinetic (PK) modeling. Arterial input function (AIF) estimation is necessary in PK modeling for accurate estimation of BBB permeability. Typically, large vessels such as the internal carotid artery (ICA), middle cerebral artery (MCA) or superior sagittal sinus (SSS) are used as a global AIF measurement 7 to minimize potential partial volume effects. Selecting a proper AIF is very important to obtain accurate PK modeling and the lack of consistent AIF selection introduces variations across studies, imaging sites, and analysis software. These variations are barriers to applying DCE‐MRI in clinical practice. Indeed, the lack of standardized AIF usage is introducing biases when comparing quantitative metrics across different studies. The selection of AIF is often done manually, which not only is time consuming but also prone to inter‐operator variabilities, highlighting the need for automated approaches to improve accuracy and consistency.
Early work in automatic AIF detection on DCE‐MRI data was done by Rijpkema et al., 8 who developed an algorithm using relative gadolinium concentrations to select vascular pixels and bolus arrival time to identify arterial pixels, and Shi et al., 9 who developed a method using an accelerated version of affinity propagation clustering for rat kidneys and human brains. More recent work has focused on using machine learning and neural networks to find an AIF and achieved good results on dynamic susceptibility contrast (DSC) MRI, CT, and PET data. A multi‐stream convolutional neural network (CNN) was used by Fan et al. 10 on DSC‐MRI perfusion images to automatically estimate the AIF. AIFNet, an end‐to‐end CNN model, was built by de la Rosa et al. 11 to select the vascular function for CT perfusion image analysis. Jiang et al. 12 utilized a deep learning approach that included both spatial and temporal information to estimate the metabolite corrected input function from dynamic PET brain images. A recent study by Moradi et al. 13 used wavelet‐based methods and unsupervised machine learning to detect AIF in PET brain images.
We are only aware of a single study that has applied these newer techniques to DCE‐MRI data. A 3D UNet deep learning model was utilized by Loos et al., 14 which captured both spatial and temporal information to extract the AIF region in the DCE‐MRI. They were able to achieve good quality AIFs on a homogeneous single center dataset, but did not analyze what effect this had on the Ktrans maps and had a relatively small training set (n = 100).
In this study, we have sought to improve the architecture of the previously proposed model 14 to make it faster and more robust. DCE‐MRI images for this study were curated from different institutions, scanners, and acquisition schemes to make the trained model more generalizable by training and testing on a larger and more diverse cohort. We additionally propose a new metric called ‘AIFitness’ score which measures the quality of the AIF curve and can be used for reliable quality control of both automatically and manually selected AIFs.
2. METHODS
2.1. Participants
Participants data were curated from five sites: the University of Southern California (USC), Washington University in St. Louis (WashU), Banner Alzheimer's Institute Phoenix, Loma Linda University, and publicly released data from MD Anderson Cancer Center at University of Texas. 15 Details on participants from Loma Linda and MD Anderson are provided in their respective papers, due to being external datasets. The study and procedures were approved by the Institutional Review Boards of USC ADRC, Washington University Knight ADRC, Banner Good Samaritan Medical Center, and Loma Linda University. Participants from USC, WashU, and Banner were included if they were between 45 and 89 years old, living independently, and had neuropsychological confirmation of no cognitive impairment or early cognitive dysfunction. Study exclusions included a history of cortical stroke, dementia, moderate‐to‐severe traumatic brain injury, major psychiatric or neurological illness, organ failure or major systemic illness, or active medications or other medical conditions that could impact cognitive function, as well as contraindications for contrast‐enhanced brain MRI. Participants were not excluded based on the presence of stable, controlled cardiovascular risk factors, such as hypertension, hyperlipidemia, coronary artery disease, or type 2 diabetes. Number of participants from each site, and demographic information (age and sex) are reported in Table 1. The NACC UDS participant IDs can be made available on request for USC, WashU, and Banner.
TABLE 1.
Demographic characteristics of participants grouped by site and data split; mean and SD are reported for age variables.
| Train | Validation | Test | Replication | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Split/Site | Age | Sex (M/F) | DCE‐MRIs | Age | Sex (M/F) | DCE‐MRIs | Age | Sex (M/F) | DCE‐MRIs | Age | Sex (M/F) | DCE‐MRIs |
| USC | 67.2 (9.2) | 53/74 | 173 | 59.8 (8.3) | 6/10 | 19 | 72.9 (8.9) | 8/8 | 18 | 69.4 (10.4) | 57/81 | 197 |
| LLU | 76.1 (8.1) | 13/18 | 31 | 72.3 (12.5) | 2/2 | 4 | 81.5 (7.0) | 3/1 | 4 | ‐ | ‐ | 0 |
| WashU | 67.0 (10.4) | 18/14 | 32 | 69.3 (4.4) | 1/3 | 4 | 70.0 (7.5) | 0/4 | 4 | 68.7 (8.0) | 65/72 | 137 |
| Banner | 64.9 (9.3) | 5/23 | 38 | 65.8 (8.9) | 0/4 | 4 | 69.0 (7.9) | 1/3 | 8 | 67.3 (8.2) | 17/34 | 87 |
| MD Anderson | 55.7 (6.7) | 11/1 | 36 | 53.0 (0) | 1/0 | 3 | 61.0 (1.4) | 2/0 | 6 | ‐ | ‐ | 0 |
| Total | 67.5 (9.2) | 100/130 | 310 | 60.9 (11.4) | 10/19 | 34 | 72.3 (9.0) | 14/16 | 40 | 68.79 (9.1) | 139/187 | 421 |
2.2. DCE‐MRI acquisition
All MRI data were acquired at 3T. At USC, MRI scans were acquired at the Mark and Mary Stevens Neuroimaging and Informatics Institute of USC, with a subset using a GE Signa HDxt with an eight‐channel head coil and the remainder using a Siemens Prisma with a 32‐channel head coil or a 20‐channel head/neck coil. At WashU MRI scans were acquired using a Siemens Vida with 64‐channel or a 20‐channel head coil. At Banner, MRI scans were acquired using a GE Discovery MR750 with an eight‐channel coil. The DCE‐MRI protocol from USC, WashU, and Banner consisted of T1 mapping and dynamic scan angled to and covering the temporal lobe. Variable flip angle (VFA) method was implemented for T1 mapping with following parameters: T1‐weighted 3D volumetric interpolated breath‐hold (VIBE), TR/TE = 4.15–8.34/1.53–3.168 ms; NEX = 1, up to 5 flip angles (2°, 5°, 10°, 12° and 15°), FOV = 175 × 175 mm, and matrix size 256 × 256, 240 × 320, or 320 × 320, 12–16 slices. The same geometry was applied to DCE‐MRI; T1w 3D VIBE sequence; TR/TE = 5.11–8.4/3 ms, NEX = 1, FA = 15°, slice thickness = 5 mm with no gap, FOV 175 × 175 mm, matrix size matching VFA, 12–16 slices, total acquisition time 8.4–16.6 min with a time resolution of 7.9–15.4 s. After a short baseline period (˜30–45 s, two to four measures), gadolinium contrast agent was administered intravenously into the antecubital vein, at a rate of 3 mL/s followed by a 25‐mL saline flush. USC used Multihance (Bracco Diagnostics Inc., Italy, 0.05 mmol/kg) and Dotarem (Guerbet, France, 0.05 mmol/kg), WashU and Banner used Dotarem (0.05 mmol/kg), LLU used Multihance (0.1 mmol/kg), and MD Anderson used Magnevist (Bayer AG, Germany, 0.1 mmol/kg). The DCE‐MRI protocol for LLU and MD Anderson were similar except the MD Anderson data was acquired axially, exact acquisition details are provided in their respective papers. 15 , 16 A summary of acquisition parameters for all protocols is given in Table S1.
2.3. Image preprocessing
All DICOM DCE‐MRI images were converted to float32 NIfTI. Motion correction was applied to dynamic series using FSL‐MCFLIRT 17 and the second repetition of the dynamic series was chosen as the reference image. The signal intensities were normalized between 0 and 1, and then all images and masks were resampled to a matrix size of 256 × 256 × 32 × 32 (x, y, z, time). The masks are applied to the image to get an array of AIF values. Finally, the image‐AIF pairs are stored in TFRecords optimized for Tensorflow's input pipeline. No datasets were excluded due to image quality.
2.4. Manual AIF selection
The manual selection of the AIF was conducted by a single scientist (A.C.) with over 14 years of experience in biomedical imaging. The internal carotid artery near the petrous segment was selected as the primary target. The jugular vein was targeted for the axial data from MD Anderson. A circular region of interest (ROI) consisting of about 10 voxels was created using ImageJ, large enough to capture the most representative temporal evolution of the dynamic signal. The size of the ROI was carefully chosen to match the artery's dimensions while ensuring it did not contact the vessel wall. A similar circular ROI was drawn in the jugular vein in 41 randomly selected cases for a direct comparison of AIFs with a venous input function (VIF).
2.5. Data split
The DCE‐MRI dataset used to train the model consisted of 384 datasets (from 289 participants) collected from five different institutions. Three of these sources included longitudinal data (USC, Banner, and MD Anderson). We randomly divided our sample by participant and site into training, validation, and testing groups using an 80–10–10 percent split. Splitting by participant ensures that the model cannot gain an advantage by seeing the same participant across training, validation, and testing. All imaging sites were split separately to ensure a proportional representation of each site in the training, validation, and testing sets. Although the split was determined by participant rather than by dataset, the final dataset splits were approximately 80–10–10 percent—of the 384 datasets, 310 were used for training, 34 for validation, and 40 for testing.
For replication, we used a naive cohort (datasets that do not have a manually drawn AIF) consisting of 326 participants and 421 datasets from three of the five sites and five of the nine protocols (for acquisition parameters see Table S1, USC Siemens Short n = 7, USC Siemens Normal n = 135, USC Siemens Mix n = 55, WashU n = 137, Banner n = 87). There was no overlap of participants in this cohort and the training, validation, and testing cohort. Table 1 provides detailed information about our data and its split into training, validation, testing, and replication samples.
2.6. Model architecture
A 3D UNet model was previously developed 14 to segment the AIF region in the human brain DCE‐MR images. However, the model performance declined over our dataset, which consists of DCE‐MRI from multiple sources (USC, WashU, Banner, LLU, and MD Anderson), likely because of differences between our dataset and the training data used in their model. Therefore, we try to improve Loos's model by making significant changes in the architecture and training of the 3D UNet model and training on a much larger, more heterogeneous dataset.
We have used a UNet 18 similar to Loos et al. 14 as the underlying architecture for our model, due to its robust performance in biomedical image segmentation with limited training data. 19 , 20 The model comprises a contracting path (encoder) and an expansive path (decoder). The encoder primarily uses convolutional and max pooling layers to downsample the input image and extract features. The decoder mainly uses upsampling and convolutional layers along with skip connections, resulting in precise pixel‐level segmentation. A 3D version of UNet is used to capture the volumetric information of the DCE‐MRI images properly.
The complete model architecture and our pipeline are shown in Figure 1. Input data are passed through the 3 levels of our UNet. The encoder at each level has a combination of convolutional, leaky rectified linear unit, and instance normalization layers followed by a max pooling layer. The decoder levels have a similar combination but include concatenation layers too, and are instead followed by an up‐sampling layer. The number of convolution filters doubles after every level of our UNet except the bottleneck layer. The bottleneck layer's convolutional layers have 320 filters for better results. The resolution of data used in this study was 0.5–1 mm × 0.5–1 mm × 4–5 mm; to keep consistent kernel size in millimeters, we used kernels that were smaller in the z direction. Through experimentation, we found that the kernel size of 11 × 11 × 3 (x, y, z) for the initial level of our encoder and decoder and 7 × 7 × 3 for the rest of the 3D convolutional layers in our UNet gave the best results. Finally, there are some custom layers after the decoder to predict the final result, which is the predicted mask overlaid on top of the specific slice of the input DCE‐MRI that will capture the AIF curve.
FIGURE 1.

Detailed architecture of the 3D UNet model for predicting AIF region. The model takes multi‐slice DCE‐MRI images as input, and predicts AIF masks and corresponding curves. The UNet consists of 3 levels of encoder and decoder with bottleneck layers. The colored arrows represent the different operations being performed at each step. The blue boxes denote feature maps, and the gray boxes represent the copied feature maps that are concatenated at each level with the blue boxes. The nominal image dimensions and number of channels of each feature map are shown for each level.
2.7. Model training and evaluation
One major modification in our approach was to design our loss function to use only temporal constraints, as opposed to the previous approaches, 14 which used both spatial and temporal constraints. This is because there can be many voxels in several major blood vessels that give near identical AIF curves. For example, the left and right carotid arteries will give near identical curves in most subjects. Adding a spatial component to the loss function needlessly constrained the model and reduced performance. In our study, we employed a modified version of the Huber loss 21 as the loss function for the predicted AIF curve. The first 10 points of the resampled AIF were weighted 3:1 compared with the last 22 points when calculating the Huber loss. This was done to ensure that enough weight was given to the first pass of the contrast bolus, which is generally only captured in a few measurements. Lastly, the loss value was multiplied by a scaling factor of 200 to match λ used by Loos. 14 The main advantage of Huber loss was its robustness to outliers, as it combined the best properties of the commonly used mean squared error (MSE) and mean absolute error (MAE). 22 The equation below mathematically represents the modified Huber loss that we used. represents the true (manual) curve, while represents the predicted curve.
We used the Adam optimizer 23 with a learning rate of 10−3. We set the model to train for 200 epochs initially with a batch size of 1, but the training ended at the 90th epoch due to the early stopping strategy of 40 epochs without a new lowest validation loss to prevent overfitting. Our code for the model was written in Python 3 using Keras 2.12 and TensorFlow 2.12. An NVIDIA 4090 GPU with 24 GB VRAM was used for training and testing the model.
2.8. AIFitness score
We developed a custom metric of AIF quality, AIFitness score, that we used to evaluate the quality of any AIF and to gauge the performance of our 3D UNet model. This metric was a combination of four factors (equations are described below and in Figure 2). The alpha weight of each score is adjusted to yield an average score of 100.
FIGURE 2.

AIFitness score captures four parameters of an AIF curve(A); average enhancement (B), bolus arrival time (C), wash out (E), and peak height (F). Parameters were adjusted to yield a zero score at end/mean = 1.1, Peaktime/Endtime = 1.0, and baseline/mean = 1.0 and a score of 0.95 at peak/mean = 3.0. The alpha weight of each score is adjusted to yield an average AIFitness score of 100. AIFitness is an average of all scores with weights (a) of 0.3, 0.3, 0.3, 0.1, respectively. A comparison graph (D) shows examples of various AIF curves and AIFitness scores. Higher scores have higher peaks, faster washout, and greater total enhancement.
We identified four characteristics of AIF curves that differentiate them from other tissue types: during the first passage of the bolus, contrast agent concentration in the blood will be higher than any other tissue, after the passage of the bolus, the concentration in the blood will decrease faster than other tissues (washout), the bolus will arrive in the blood earlier than other tissues, and the mean concentration in the blood over the entire time measured will be higher than other tissues. Scores to quantify each of these characteristics were created that fulfill two criteria: (1) have a maximum value of 1 so a single good score cannot outweigh one or more bad scores, and (2) be scale invariant so arbitrary scaling of signal intensity cannot affect the score.
Peak to mean score is a sigmoid function that returns a higher score for a high peak value during the bolus passage of contrast agent. For venous contrast injections, the blood concentration should be higher than any other tissue during the first pass of the bolus. The function has a maximum value of 1 and reaches 0.95 at a ratio (peak/mean) of 3. This prevents noise areas with a single high point from skewing the score too high.
End to mean score is an inverted parabola with a maximum score of 1 when end/mean is zero and has a negative value when end/mean >1.1. This selects for a faster washout with a lower concentration at the end of the AIF and also penalizes curves where the end value is above the mean value with a negative score. For a standard bolus injection, AIF curves should consistently drop after the first passage of the bolus as contrast is transported out of the blood to other tissues.
Arrival Time score is a simple linear function with a maximum value of 1 when the maximum value occurs at the first time point after baseline and decreases to 0 when the maximum value occurs at the last time point. The score selects for an early arrival time of the bolus as the arteries should reach maximum value before any other tissue.
Baseline to mean is an inverted parabola with a maximum score of 1 when baseline/mean is zero and has negative values when baseline/mean >1. This selects for higher average concentration of contrast agent, as the blood should have the highest average concentration for most scans. This also penalizes any voxel with a mean value less than baseline, as the average value should never drop below baseline for an AIF (this would imply negative amount of contrast agent, a bad baseline measurement, or high noise, all indications of a poor AIF). The assumption that blood will have the highest mean concentration may be violated for very long scans (>30 min) as the contrast agent accumulates in the bladder. However, the signal in the bladder would not return a high score for other metrics (washout and arrival time) and would not return a high overall AIFitness score.
2.9. Hyper‐parameter tuning and other experiments
We conducted various experiments to select, improve, and optimize our model's performance. The initial experiment was about deciding the final architecture of our model. We experimented with multiple variations of UNet: a plain 3D UNet, Attention 3D UNet, 24 3D UNet with single head attention, and 3D UNet with Squeeze‐and‐Excitation block. 25 The most promising were the simple 3D UNet and the 3D UNet with modified single‐head attention models and these were used in subsequent analysis. The single‐head attention mechanism allows each pixel in the spatial feature map to attend to every other pixel, which helps capture long‐range dependencies and relationships between pixels.
The other set of experiments dealt with finding the most suitable loss function, optimizer, and learning rate for our model. Initially, for the loss function, we tried some spatial metrics like the Dice coefficient and Jaccard index, and the one developed by Loos. 14 Then, we tried a weighted combination of spatial and temporal/regression constraints. Lastly, we tried utilizing just temporal loss functions like MAE and MSE, which were modified similarly to the Huber loss used finally.
Of the top 3 models ranked by mean AIFitness, one of them had a single‐head attention mechanism 26 in the bottleneck of the 3D UNet. The other one utilized a modified version of this single‐head attention in the bottleneck for improved results. Both these attention models also applied a learning rate strategy for better model convergence.
2.10. DCE‐MRI processing
We used ROCKETSHIP 27 software for the pharmacokinetic modeling of DCE‐MRI images to create multi‐slice Ktrans maps. ROCKETSHIP is an open‐source software package providing tools for the creation of parametric maps of tissue perfusion, permeability and related hemodynamic properties (Ve, Vp, Ktrans, etc). The software supports various models and methods; single and multi‐compartmental modeling of tracer kinetics such as Patlak, Tofts, Ex‐Tofts, 2CXM, FXR, Steady‐state. 27 , 28 In this study, Ktrans maps were generated for the automatically and manually generated AIF curves. In brief, preprocessing was applied to the variable flip angle and DCE data including bias field correction (FSL FAST), z‐slice normalization (inhouse script), brain extraction (HD‐BET), and motion correction (FSL MCFLIRT). All acquired images were co‐registered using ANTS. ROCKETSHIP was then used to generate the T1 maps, and the Ktrans maps using the Patlak pharmacokinetic model. Tissue segmentation was performed on the T1‐MPRAGE images using FSL‐FAST and these masks were applied to the Ktrans maps to extract quantitative values.
2.11. Statistical analysis
All statistical testing was performed using Python 3.10.12. Values of p < 0.05 were considered significant. AIFitness values were compared using paired two‐sided t‐tests. To compare AIF curves, the millimolar concentration AIF curve for each subject was shifted to align max value across subjects, then each time point was compared using a two‐sided paired t‐test calculated with SciPy 1.14.1. No multi‐comparison correction was applied since all the values are highly correlated. Ktrans values were compared by taking the mean value for gray matter (GM) and white matter (WM) for each subject with whole brain coverage, and the mean cerebellum value for subjects with partial brain coverage. Spearman's correlation was calculated with SciPy 1.14.1, and intraclass correlation coefficient (ICC) type 2 calculated with Pingouin 0.5.5.
3. RESULTS
Table 2 shows the performance of the test cohort of our best three models using the Adam optimizer and learning rate and loss function selected from the hyperparameter tuning. The architecture proposed by Loos generally performed very poorly on our data, while comparatively modest differences were found across our top models. The Huber loss model (Figure S1) was considered the best performing with the highest mean AIFitness score, with significantly lower SD compared to the other models.
TABLE 2.
AIFitness scores for manual and automatic models.
| Model | AIFitness | ||
|---|---|---|---|
| Mean | SD | Failures | |
| 3D UNet with additional CNN kernels (best) | 93.9 | 20.5 | 1 |
| 3D UNet with attention and learning rate strategy | 90.15 | 24.96 | 1 |
| 3D UNet with modified attention and learning rate strategy | 91.28 | 21.5 | 1 |
| Original Loos Model (minimal changes) | 45.02 | 13.57 | 6 |
| Manual | 99.7 | 24.4 | 0 |
Our final model achieved a mean AIFitness score of 93.9 with a SD of 20.5 on the test cohort, where the manually drawn masks had a mean score of 99.7 with a SD of 24.4, these were not significantly different (p = 0.06, Table 3). While the model generally performed well, there were a few cases where it failed to find a valid AIF. AIFitness scores less than 54 were unable to provide diagnostic DCE data and were recorded as a failure of the model and excluded from comparisons. An example of such a failure is included in Figure S2. All manually drawn AIFs were above the AIFitness failure threshold.
TABLE 3.
Performance of AIFitness score and Ktrans metric (min · 10−3) with manual selection and best performing auto AIF model.
| Parameter | Manual AIF: Test | Auto AIF: Test | Auto AIF: Replication |
|---|---|---|---|
| Subject/DCE‐MRI | 30/40 | 30/40 | 326/421 |
| Failed cases | NA | 1 | 1 |
| AIFitness mean | 99.7 | 93.9 | 97.0 |
| AIFitness SD | 24.4 | 20.5 | 17.3 |
| AIFitness 5% | 67.2 | 67.1 | 70.1 |
| Ktrans GM median | 0.61 | 0.59 | 0.60 |
| Ktrans GM IQR | 0.61 | 0.70 | 0.40 |
| Ktrans WM median | 0.45 | 0.45 | 0.44 |
| Ktrans WM IQR | 0.70 | 0.57 | 0.30 |
AIF curves from the model were directly compared with manually selected AIF curves in the test cohort, by aligning the bolus arrival time across all participants, and then averaging all curves together (Figure 3A). The curves were extremely similar, with the largest difference seen at the time point of the peak with the automatic AIF being lower (p = 0.015); no other time points show a significant difference. The inset brain images demonstrate spatial differences in voxel selection between auto and manual approaches. Manual selection tends to be more focal and symmetric, while auto detection may demonstrate minor differences. It is important to note that the auto detection localizes well to arteries even though no spatial weight was included in the loss function, which may suggest a strong generalizability and thus applicability to other anatomical regions.
FIGURE 3.

Comparison of AIF concentrations (A) and Ktrans values (B) using a manually selected AIF and the AIF predicted with the proposed model on the test cohort that was withheld from all training and validation. AIF concentration curves (A) were shifted to align max value across participants, mean and 95% confidence intervals are shown, inset shows location of auto and manual AIFs aligned to a common template, 15 mm thick slice shown. (B) Ktrans values are an average of whole brain gray matter, white matter. For datasets that did not cover the whole brain cerebellum values are used. Intraclass correlation of r = 0.89. (C) Paired Bland Altman plot of the test cohort showing manual AIF Ktrans means minus automatic AIF Ktrans means. Red lines represent 95% confidence intervals of the mean difference. (D) Violin plots of test cohort compared to replication cohort showing a similar distribution of Ktrans values.
Ktrans values from the test cohort were calculated using the automatic model selected AIF and the manually selected AIF (Figure 3B). The Ktrans values were highly consistent with a Spearman's correlation coefficient of r = 0.87 and an intraclass correlation coefficient of 0.89. No significant difference was found between auto and manual AIF Ktrans values for the gray matter or the white matter with a paired Wilcoxon signed‐rank test. The Bland–Altman plot (Figure 3C) comparing the two methods shows a dense cluster of points near a difference of zero, suggesting reasonable agreement between auto and manual methods. However, some variations exist, particularly at higher permeability values, resulting in a slightly manual AIF biased mean difference.
The replication dataset consisting of 421 datasets never seen during training showed very similar results to the test cohort. It was able to achieve an even higher AIFitness score of 97.0 and lower SD of 17.3 with a very similar distribution of Ktrans values (Figure 3D). The higher AIFitness score and lower Ktrans IQR is likely due to this dataset only having data from three of the five sites.
We tested our approach in a glioblastoma patient to evaluate the UNet performance in a clinical setting with known BBB disruption (Figure 5). The auto generated AIF ROI, curve, and resulting Ktrans map revealed heterogeneous and markedly elevated permeability in contrast‐enhanced regions.
FIGURE 5.

Example of a glioblastoma case. T1‐weighted pre‐contrast axial slice with the corresponding ROI, automatically detected from superior sagittal sinus. The corresponding AIF curve shape, AIFitness score, and multislice Ktrans map are displayed.
A direct comparison of manually drawn AIFs and VIFs in 41 randomly selected patients showed that VIFs yield an overall lower AIFitness score, a lower peak‐to‐mean signal intensity, and a slower washout of contrast (possibly indicating more tissue contamination), all indicating a lower AIF quality in this cohort (Figure S3).
4. DISCUSSION
A major advantage of automatically defining the AIF is consistency. As shown in Table 2, the proposed automatic AIF did produce an approximately 6% lower average AIFitness score, but it was able to reduce the variance of AIFitness values by 16%. This reduced variance in AIF quality also appears to have resulted in reduced variance of measured Ktrans values across a diverse multi‐center testing dataset compared to manually selected AIFs. While there could be many different reasons for this, we believe this represents a reduction in the noise from AIF selection. It has previously been shown that the use of automatic or semi‐automatic AIF selection leads to improvements in reproducibility 29 compared to manual AIF selection. This increased reproducibility suggests using an automatic AIF removes variance from noise, and we believe the decreased variance observed here represents an improvement in data quality.
The AIF curves found by the proposed model were very similar to the manually identified ones. When looking at the average values over time, the point that showed the most difference was the bolus peak value. While the peak value is very important for calculating flow, many applications of DCE‐MRI do not calculate flow, and the peak value makes very little difference for calculating Ktrans values. A recent consensus paper even recommends excluding the peak value when calculating Ktrans as errors in the peak measurement degrade the quality of the fit. 30 For most DCE‐MRI applications that do not calculate flow, we do not think this is a significant concern, and the calculated Ktrans values presented here did show a very high correlation (r = 0.87).
The AIFitness quality score, presented here as a metric to compare AIF quality, also works very well as a quality control measure. Data with an AIFitness score below a threshold value can be flagged as low quality, and then either excluded from analysis or flagged for manual AIF selection if appropriate. This provides a robust and consistent way to monitor failures of the automatic AIF model. This can be a useful tool for highly automated processing of large datasets. In this study, we considered AIFitness scores below 54 as a model failure and excluded them from analysis. A few poor curves are shown in Figure 2D and Figure S2. Setting an appropriate exclusion threshold is highly dependent on the individual study and must be determined based on the goals of the study and the other QC processes utilized.
The AIFitness quality score works best as a relative measure to compare multiple AIFs from a single dataset, or to compare AIFs from a cohort with identical acquisition parameters. However, changes to the acquisition protocol can cause systemic shifts in AIFitness scores. For example, different acquisition durations will result and different end signal intensities which will change AIFitness scores. Therefore, caution is warranted when trying to compare AIFitness scores across diverse studies, and any quality thresholding is recommended on a per acquisition protocol basis. A small verification study similar to Fritz‐Hansen et al. 31 would be useful to establish ground truth, but such invasive measurements are not practical in large diverse populations necessary for training a neural net.
Compared to prior studies, this study used a significantly larger multi‐center dataset that had more variation in the data (see Table S1). This is likely the reason for the poor performance of the Loos model, even after retraining their model on our data. While our new architecture handles this variety of data much better, Figure 4 shows there are still significant differences in model performance on data from different imaging sites. Unsurprisingly, the model performs the worst on the MD Anderson data, which is both very different from other sites (data is acquired in the axial plane targeting the neck) and has a relatively small sample size (n = 45). The performance is much better and more consistent across the other imaging sites that were all acquired in the coronal plane. However, the large and diverse training set does allow the model to generalize quite well, and it was able to perform acceptably on a brain tumor case (Figure 5) despite this data being quite different than any of the training data.
FIGURE 4.

Five representative examples from each data center are shown. (A) Coronal DCE‐MRI images with manually and automatically identified AIF regions (zoom in). (B) Normalized AIF curves over time for each method. (C) Color‐coded Ktrans maps overlaid on structural images.
Compared to the prior model by Loos, a number of improvements have been made apart from the improved results. The number of time points that are considered by the model was significantly increased, from 7 to 32. This makes fewer assumptions about the data (e.g., bolus arrival time) and should provide for more robust results. The data pipeline was also reworked for much faster model training. Our final model averaged approximately 750 s per epoch compared to 1200 s per epoch for the Loos model (on a single GeForce RTX 4090). The image resampling was directly incorporated into the program so it can be run on images with various matrix sizes without the need for pre‐processing. All of these changes improve the usability of this model on a wide variety of data.
The relatively low number of subjects in the validation and testing groups was chosen to maximize the number in the training group, but this is a limitation to take into consideration. While we were able to assemble a large dataset for training (n = 384), a significant limitation is the relatively small number of axial datasets (n = 45). This resulted in performance on axial data that was lower than coronal data, and future work should focus on expanding the training dataset to include more data in the axial and sagittal planes. Finally, expanding the use of this pipeline to other organs, adding additional features to prefer or avoid certain areas of the images, and training to model to use phase data for AIF selection 32 is also a possibility to test the newly developed model.