Methods for Transfer-learning Based Integrated Cox Models

The survkl package implements a transfer-learning procedure that integrates external summary information with newly collected time-to-event data under a Cox proportional hazards model. This vignette summarizes the underlying methodology: the internal Cox model, the external summary information, the partial likelihood-based Kullback–Leibler (KL) transfer-learning objective, and the regularized extension for high-dimensional data.

Cox Proportional Hazards Model for the Target Cohort

Let \(D_i\) denote the death time and \(C_i\) the censoring time for patient \(i\), \(i = 1, \ldots, n\), where \(n\) is the total sample size of the target (internal) cohort. The observed survival time is \(T_i = \min\{D_i, C_i\}\), and the death indicator is \(\delta_i = \mathbb{I}(D_i \le C_i)\). Let \(Z_i = (Z_{i1}, \ldots, Z_{ip})^\top\) be a \(p\)-dimensional covariate vector for the \(i\)-th patient. We assume that, conditional on \(Z_i\), \(D_i\) is independently censored by \(C_i\). Consider the Cox proportional hazards model

\[ \lambda(t \mid Z_i) = \lambda_0(t)\,\exp\{g(Z_i, \beta)\}, \]

where \(\lambda_0(t)\) is an arbitrarily unspecified baseline hazard function, \(g(Z_i, \beta)\) specifies the log-relative-risk relationship between the covariates \(Z_i\) and the hazard function, and \(\beta \in \mathbb{R}^p\) is a vector of regression parameters. Under the standard linear specification, \(g(Z_i, \beta) = Z_i^\top \beta\). The log-partial likelihood is given by

\[ \ell(\beta) = \sum_{i=1}^{n} \delta_i \left[ g(Z_i, \beta) - \log\left\{ \sum_{l=1}^{n} Y_l(T_i)\,\exp\{g(Z_l, \beta)\} \right\} \right], \]

where \(Y_l(T_i) = \mathbb{I}(T_l \ge T_i)\) is the at-risk indicator.

External Summary Information

To account for privacy constraints, we consider scenarios where only external summary information is available, rather than individual-level external data. For example, suppose the estimated coefficients \(\tilde{\beta}\) are available from a published Cox model; a risk score can then be computed as \(\tilde{g}(Z_i) = Z_i^\top \tilde{\beta}\) for the \(i\)-th subject in the target cohort. The proposed transfer-learning procedure is flexible and can incorporate various forms of external summary information, including estimated risk scores from machine-learning algorithms and clinically derived risk groupings.

Partial Likelihood-Based Transfer Learning

To extract information from external risk scores, we formulate the censored time-to-event data as a dynamic ranking problem. Specifically, suppose the internal cohort comprises \(K\) unique failure times \(t_1 < \cdots < t_K\). Let \(A_k\) specify that individual \(k\) fails in \([t_k, t_k + dt_k)\), and let \(B_k\) specify all the censoring and failure information up to time \(t_k^{-}\), together with the information that one failure occurs in \([t_k, t_k + dt_k)\). Based on the external risk scores, the conditional density of \(A_k\) given \(B_k\) is

\[ \tilde{f}(A_k \mid B_k) = \frac{\tilde{\lambda}_0(t_k)\,\exp\{\tilde{g}(Z_k)\}\,dt_k} {\sum_{i=1}^{n} Y_i(t_k)\,\tilde{\lambda}_0(t_k)\,\exp\{\tilde{g}(Z_i)\}\,dt_k} = \frac{\exp\{\tilde{g}(Z_k)\}} {\sum_{i=1}^{n} Y_i(t_k)\,\exp\{\tilde{g}(Z_i)\}}, \]

where the second equality follows from canceling \(\tilde{\lambda}_0(t_k)\,dt_k\) in the numerator and denominator. Following Wang et al. (2023), the partial likelihood-based KL divergence between the conditional densities corresponding to the external risk scores and the internal Cox model, contained in \(A_k \mid B_k\), is given by

\[ d_{\mathrm{KL}}(\tilde{f} \parallel f;\, t_k) = \mathbb{E}_{\tilde{f}} \left[ \log\left\{ \frac{\tilde{f}(A_k \mid B_k)}{f(A_k \mid B_k)} \right\} \right], \]

where the expectation is taken with respect to the external conditional density \(\tilde{f}(A_k \mid B_k)\), and \(f(A_k \mid B_k)\) is the conditional density based on the internal Cox model,

\[ f(A_k \mid B_k) = \frac{\exp\{g(Z_k, \beta)\}} {\sum_{i=1}^{n} Y_i(t_k)\,\exp\{g(Z_i, \beta)\}}. \]

When \(\tilde{g}(Z_k)\) is generated from clinically derived risk groupings, \(\tilde{f}(A_k \mid B_k)\) does not represent a formal conditional density; instead, it can be viewed as a Plackett–Luce ranking metric, and \(d_{\mathrm{KL}}(\tilde{f} \parallel f;\, t_k)\) can be interpreted as a generalized KL divergence. The accumulated KL divergence across the sequence of conditional experiments \(A_1 \mid B_1, \ldots, A_K \mid B_K\) is

\[ D_{\mathrm{KL}}(\tilde{f} \parallel f) = \sum_{k=1}^{K} d_{\mathrm{KL}}(\tilde{f} \parallel f;\, t_k), \]

which measures the discrepancy between the external risk scores and the internal Cox model. To integrate external information while accounting for potential disparities, we combine the internal log-partial likelihood with the accumulated KL divergence by constructing the penalized objective function

\[ \ell_{\eta}(\beta) = \ell(\beta) - \eta\, D_{\mathrm{KL}}(\tilde{f} \parallel f), \]

where \(\eta \ge 0\) is a tuning parameter that controls the trade-off between the internal model and the external risk scores. Setting \(\eta = 0\) recovers the internal-only Cox fit, whereas larger values of \(\eta\) place more weight on the external information.

Equivalent weighted form. Substituting the Cox-model expressions and noting that the unique failure times \(t_1 < \cdots < t_K\) coincide with the observed internal event times, the integrated objective admits the equivalent weighted partial-likelihood form

\[ \ell_{\eta}(\beta) \;\propto\; \sum_{i=1}^{n} \left\{ \frac{\delta_i + \eta\, \tilde{\delta}_i}{1 + \eta}\, g(Z_i, \beta) - \delta_i \log\left[ \sum_{l=1}^{n} Y_l(T_i)\,\exp\{g(Z_l, \beta)\} \right] \right\}, \]

where the externally induced pseudo-event weight is defined as

\[ \tilde{\delta}_i = \sum_{k=1}^{K} \frac{Y_i(t_k)\,\exp\{\tilde{g}(Z_i)\}} {\sum_{j=1}^{n} Y_j(t_k)\,\exp\{\tilde{g}(Z_j)\}}. \]

This representation shows that the external information enters the internal partial likelihood by augmenting each subject’s observed event indicator \(\delta_i\) with a fractional pseudo-event weight \(\tilde{\delta}_i\) derived from the external risk scores, with \(\eta\) governing the relative contribution of the two sources.

Regularization for High-Dimensional Data

For high-dimensional applications, where the number of covariates \(p\) may be large relative to the sample size \(n\), we extend the integrated objective by adding a regularization term. The resulting objective function enables simultaneous variable selection and parameter estimation:

\[ \ell_{\eta, \lambda}(\beta) = \ell_{\eta}(\beta) - \lambda\, P(\beta), \]

where \(P(\beta)\) is a penalty function and \(\lambda \ge 0\) is a tuning parameter controlling its strength. The package supports the following choices of \(P(\beta)\):

  • Ridge (Hoerl and Kennard, 1970): \[ P(\beta) = \tfrac{1}{2}\,\|\beta\|_2^2 = \tfrac{1}{2}\sum_{j=1}^{p} \beta_j^2, \] which shrinks coefficients toward zero and stabilizes estimation under collinearity.

  • LASSO (Tibshirani, 1997): \[ P(\beta) = \|\beta\|_1 = \sum_{j=1}^{p} |\beta_j|, \] which produces sparse solutions by setting some coefficients exactly to zero.

  • Elastic Net (Simon et al., 2011): \[ P(\beta) = \alpha\,\|\beta\|_1 + \tfrac{1}{2}(1 - \alpha)\,\|\beta\|_2^2 = \sum_{j=1}^{p}\left[ \alpha\,|\beta_j| + \tfrac{1}{2}(1 - \alpha)\,\beta_j^2 \right], \] where \(\alpha \in [0, 1]\) is a mixing parameter that blends the LASSO and ridge penalties; \(\alpha = 1\) reduces to the LASSO and \(\alpha = 0\) to ridge.

In survkl, ridge-penalized estimation is provided by coxkl_ridge, while the elastic-net family (including the LASSO as the special case \(\alpha = 1\)) is provided by coxkl_enet. The companion cross-validation routines cv.coxkl, cv.coxkl_ridge, and cv.coxkl_enet perform \(K\)-fold cross-validation to select the integration weight \(\eta\) and the regularization parameter \(\lambda\), using Harrell’s C-index for discrimination and the V&VH loss for overall model fit.