Deep learning predicts postsurgical recu

An overall framework for the deep learning-based system for predicting the risk score for RFI, hereafter referred to as HCC-SurvNet, is shown in Fig. 1. The system consists of two stages, i.e. tumor tile classification and risk score prediction.

Tumor tile classification

To develop a deep convolutional neural network (CNN) to automatically detect tumor-containing tiles within WSI, we used the Stanford-HCCDET (n = 128,222 tiles from 36 WSI) dataset. All tumor regions in each WSI in the Stanford-HCCDET dataset were manually annotated by the reference pathologist (J.S.). Each WSI was preprocessed and tiled into image patches. Using these ground truth labels and image tiles, we trained and tested a CNN using 78% of WSI (100,976 tiles from 28 WSI) in the Stanford-HCCDET for training, 11% (15,834 tiles from 4 WSI) for validation, and 11% (11,412 tiles from 4 WSI) for internal testing, with no patient overlap between any of these three sets. The final optimized tumor versus non-tumor tile classifier was externally tested on 30 WSI (n = 82,532 tiles) randomly sampled from the TCGA-HCC dataset.

Among the tiles in the internal test set, 25.7% (2932 of 11,412 tiles) were tumor positive, whereas 48.8% (40,288 out of 82,532 tiles) were tumor positive in the external test set. The accuracies of tumor tile classification were 92.3% and 90.8% on the internal and external test sets, respectively. The areas under the receiver-operating-characteristic-curve (AUROCs) were 0.952 (95% CI 0.948, 0.957) and 0.956 (95% CI 0.955, 0.958) for the internal and external test sets, respectively. Model outputs showed a statistically significant difference between tiles with a ground truth of tumor versus non-tumor, on both the internal and external test sets (p < 0.0001 and p < 0.0001, respectively) (Figs. 2, 3).

Risk score prediction

Datasets

To develop a risk score prediction model, we used two datasets: the TCGA-HCC and Stanford-HCC datasets, originating from two independent data sources, the Cancer Genome Atlas (TCGA)-LIHC diagnostic slide collection and the Stanford Department of Pathology slide archive, respectively. The TCGA-HCC was further split into TCGA-HCC development and test datasets.

The TCGA-HCC development dataset (containing the training and validation sets) consisted of 299 patients (median age of 60 years, with an interquartile range (IQR) of 51–68 years, 69% male and 31% female). The frequencies of risk factors for HCC were: 32% for hepatitis B virus infection, 15% for hepatitis C virus infection, 34% for alcohol intake, and 4.9% for NAFLD. The AJCC (8th edition) stage grouping was IA in 2.7%, IB in 41%, II in 29%, IIIA in 20%, IIIB in 5.4%, IVA in 1.0%, and IVB in 0.3% of the patients, respectively. One hundred and fifty-one patients experienced disease recurrence during follow-up (median follow-up time of 12.2 months) (Table 1).

Table 1 Patient characteristics for the Stanford-HCC and TCGA-HCC datasets.

The TCGA-HCC test dataset consisted of 53 patients (median age of 61 years, with an IQR of 51–68 years, 62% male and 38% female). The frequencies of risk factors for HCC were: 33% for hepatitis B virus infection, 16% for hepatitis C virus infection, 39% for alcohol intake, and 10% for NAFLD. The AJCC stage grouping was IA in 1.9%, IB in 46%, II in 31%, IIIA in 17%, IIIB in 1.9%, and IVB in 1.9% of the patients. Twenty-five patients experienced recurrence during follow-up (median follow-up time of 12.7 months) (Table 1). None of the clinicopathologic features were significantly associated with shorter RFI upon univariable Cox regression analysis, while a Batts–Ludwig²² fibrosis stage > 2 showed borderline significance (hazard ratio (HR) = 2.7 (95% confidence interval (CI) 0.98, 7.7), p = 0.0543) (Table 2).

Table 2 Univariable Cox proportional hazards analysis of the risk of recurrence.

The Stanford-HCC dataset consisted of 198 patients (median age of 64 years, with an IQR of 57–69 years, 79% male and 21% female). The frequencies of risk factors for HCC were: 26% for hepatitis B virus infection, 52% for hepatitis C virus infection, 8.6% for alcohol intake, and 7.1% for NAFLD. The overall AJCC stage grouping was IA in 22%, IB in 21%, II in 34%, IIIA in 5.6%, IIIB in 3.5%, and IVA in 1.0% of the patients, respectively. Sixty-two patients experienced disease recurrence during follow-up (median follow-up time of 24.9 months) (Table 1). The clinical and pathologic features associated with shorter RFI were AJCC stage grouping > II [HR = 4.4 (95% CI 2.3, 8.3), p < 0.0001], greatest tumor diameter > 5 cm [HR = 3.5 (95% CI 2.1, 5.8), p < 0.0001], histologic grade > moderately differentiated [HR = 2.1 (95% CI 1.2, 3.9), p = 0.0128], presence of microvascular invasion [HR = 3.9 (95% CI 2.4, 6.5), p < 0.0001], presence of macrovascular invasion [HR = 5.3 (95% CI 2.1, 13), p < 0.0001], positive surgical margin [HR = 6.8 (95% CI 1.6, 28), p = 0.009], and fibrosis stage > 2 [HR = 0.33 (95% CI 0.2, 0.55), p < 0.0001] using univariable Cox regression analysis (Table 2).

HCC-SurvNet performance for RFI prediction

The tumor tile classification model was applied to each tissue-containing image tile in the TCGA-HCC development (n = 299 WSI) and test (n = 53 WSI) datasets and the Stanford-HCC dataset (n = 198 WSI). From each WSI, the 100 tiles with the highest probabilities for the tumor class were selected for input into the subsequent risk score model. Figure 4 shows examples of tiles with probabilities in the top 100 for containing tumor, overlaid onto the original WSI. A MobileNetV2²³ pre-trained on ImageNet²⁴ was modified by replacing the fully-connected layers, and fine-tuned by transfer learning with on-the-fly data augmentation on the tiles from the TCGA-HCC development dataset (n = 307 WSI from 299 patients), where the model input was a 299 × 299 pixel image tile, and the output was a continuous tile-level risk score from the hazard function for RFI. The negative partial log-likelihood of the Cox proportional hazards model was used as a loss function^14,15. The model’s performance was evaluated internally on the TCGA-HCC test dataset (n = 53 WSI from 53 patients), and externally on the Stanford-HCC dataset (n = 198 WSI from 198 patients). All tile-level risk scores from a patient were averaged to yield a patient-level risk score.

We assessed HCC-SurvNet’s performance using Harrell’s²⁵ and Uno’s²⁶ concordance indices (c-indices). On the internal test set (TCGA-HCC test dataset, n = 53 patients), Harrell’s and Uno’s c-indices were 0.724 and 0.724, respectively. On the external test set (Stanford-HCC, n = 198 patients), the indices were 0.683 and 0.670, respectively. We observed statistically significant differences in the survival distributions between the low- and high-risk subgroups, as stratified by the risk scores predicted by HCC-SurvNet, on both the internal and external test sets (log-rank p value: 0.0013 and < 0.0001, respectively) (Figs. 5, 6).

Histograms of HCC-SurvNet’s risk scores, along with the threshold used for risk group stratification, are shown in Supplementary Fig. 1. On univariable Cox proportional hazards analysis, the HCC-SurvNet risk score was a predictor of the RFI, for both the internal [HR = 6.52 (95% CI 1.83, 23.2), p = 0.0038] and external [HR = 3.72 (95% CI 2.17, 6.37), p < 0.0001] test sets (Table 2). A continuous linear association between HCC-SurvNet’s risk score and the log relative hazard for RFI was observed by analysis of the internal and external test cohorts by univariable Cox proportional hazards regression with restricted cubic splines (Supplementary Fig. 2), validating the use of HCC-SurvNet’s risk score as a linear factor in the Cox analyses.

On multivariable Cox proportional hazards analysis, HCC-SurvNet’s risk score was an independent predictor of the RFI, for both the internal [HR = 7.44 (95% CI 1.60, 34.6), p = 0.0105] and external [HR = 2.37 (95% CI 1.27, 4.43), p = 0.00685] test sets (Table 3).

Table 3 Multivariable Cox proportional hazards analysis of the risk of recurrence.

No other clinicopathologic variable was statistically significant on the internal test set. Microvascular invasion [HR = 2.84 (95% CI 1.61, 5.00), p = 0.000294] and fibrosis stage [HR = 0.501 (95% CI: 0.278, 0.904), p = 0.0217] showed statistical significance on the external test set, along with HCC-SurvNet’s risk score. Schoenfeld’s global test showed p values greater than 0.05 on both the internal (p = 0.083) and external (p = 0.0702) test sets. On mixed-effect Cox regression analysis with the TCGA institution as a random effect, HCC-SurvNet’s risk score was an independent predictor (p = 0.014), along with the histologic grade (p = 0.014) and macrovascular invasion (p = 0.013). In the external test (Stanford-HCC) cohort, HCC-SurvNet’s risk score was positively associated with the AJCC stage grouping, greatest tumor diameter, and microvascular invasion, and negatively associated with fibrosis stage (Table 4). HCC-SurvNet’s risk score yielded a significantly higher Harrell’s c-index (0.72 for the internal and 0.68 for the external test cohort) than that obtained using the AJCC Stage grouping (0.56 for the internal and 0.60 for the external test cohort), on both the internal and external test cohorts (p = 0.018 and 0.025, respectively).

Table 4 Association between the HCC-SurvNet risk score and various patient characteristics in the external test (Stanford-HCC) cohort.

Deep learning predicts postsurgical recurrence of hepatocellular carcinoma from digital histopathologic images | Scientific Repo

Tumor tile classification

Risk score prediction

Datasets

HCC-SurvNet performance for RFI prediction