# An explainable device mastering-based phenomapping method for adaptive predictive enrichment in randomized scientific trials

Table of Contents

### Info resource and client inhabitants

The Insulin Resistance Intervention right after Stroke (IRIS) trial recruited individuals at the very least 40 a long time of age with a new ischemic stroke or a transient ischemic assault (TIA) all through the 6 months prior to randomization, who did not have diabetic issues mellitus at the time of enrollment but had evidence of insulin resistance centered on a homeostasis design assessment of insulin resistance (HOMA-IR) index rating of 3. or bigger. Participants have been randomly assigned in a 1:1 ratio to receive possibly pioglitazone or matching placebo (with dose up-titration as specified in the original demo report)^{18}. People had been contacted every single 4 months, and participation finished at 5 many years or at the previous scheduled contact right before July 2015. The Systolic Blood Tension Intervention Trial (Dash) enrolled 9361 members, 50 a long time of age or more mature, with a systolic blood tension (SBP) of 130–180 mm Hg with or without the need of antihypertensive drug treatment method as nicely as an added indicator of cardiovascular danger. These provided scientific or subclinical cardiovascular ailment, continual kidney disease, 10-12 months chance Framingham Possibility Score of cardiovascular disorder of 15% or larger or age of 75 several years or more mature. Clients with diabetes mellitus, prior stroke, or dementia were excluded from this demo. Participants were being enrolled among 2010 and 2013 at 102 medical sites in the U.S^{19}.

### Review outcomes

In accordance with the main end result of the unique trials, we focused on a composite of first deadly or non-fatal stroke or deadly or non-fatal myocardial infarction as the main final result for IRIS, and a composite of myocardial infarction, acute coronary syndrome not ensuing in myocardial infarction, stroke, acute decompensated coronary heart failure, or loss of life from cardiovascular results in for Sprint. We also explored a hierarchical composite final result as follows IRIS: all-lead to mortality, followed by non-lethal MACE (big adverse cardiovascular activities) parts, hospitalization functions, coronary heart failure situations, and bone fractures Sprint: all-bring about mortality, adopted by non-deadly MACE and eventually significant adverse gatherings. Definitions have been concordant with individuals utilized in the authentic demo stories^{18,19}. All results and picked basic safety functions had been adjudicated by the associates of impartial committees in a blinded style for just about every of the trials.

### Design of a group sequential, adaptive demo experiment

We designed a simulation algorithm to take a look at the hypothesis that interim ML-guided analyses of computational trial phenomaps can adaptively guidebook the trials’ enrollment method and increase their efficiency even though lowering their final/required sizing. The tenet of this technique is that ML phenomapping-derived insights can steer the recruitment toward people who are a lot more probably to reward from the intervention. For this, we outlined three interim examination timepoints, with the final examination taking place once all primary functions had been described. It should really be pointed out that the unique electric power calculation for IRIS had assumed increased occasion and more rapidly enrollment rates than the kinds that ended up noticed all through the trial, consequently prompting serial amendments to the demo protocol, which includes an extension of recruitment and an boost in the research dimensions (from 3136 people in the beginning to 3936 patients). In a submit-hoc trend, realizing that the main consequence happened in 228 of 1937 participants in the placebo arm (~11.8% charge) and 175 of 1939 members in the pioglitazone arm (~9.%), we simulated power calculations assuming a superiority trial style with a 1-sided α of .025 (see “power calculations” segment below). We outlined the timepoint at which 50, 100, and 150 overall main final result occasions had been recorded in the unique trial as the timepoint for our very first, 2nd and 3rd interim assessment timepoints, respectively. In Sprint we assumed that the respective key outcome would take place in 6.8% vs 5.4% of the regular and intense arms, and for regularity, defined the interim evaluation timepoints based mostly on the prevalence of the initially 50, 100, and 150 most important end result situations.

### Overview of the predictive enrichment strategy

Throughout the initially enrollment period of time of the simulation (analyze onset right up until initial interim assessment) we enrolled all demo individuals, similar to the primary trials, with no any limitations or modifications in the enrollment procedure. Commencing at the initially interim examination timepoint, participants recruited up until eventually that phase ended up randomly break up into training/cross-validation (50%) and testing sets (50%). In the teaching set, baseline data ended up pre-processed and utilised to determine a phenomap (see “Machine mastering trial phenomapping” part beneath), which represented the phenotypic architecture of the populace across all axes of baseline variance. Via iterative analyses centered all-around every single distinctive individual and weighted for each unique participant’s area in the phenotypic area, we described individualized estimates of the results of the analyzed intervention, as as opposed to the management arm, for the primary end result.

Subsequently, we built an ML framework to determine important functions that collectively determined a phenotypic signature (algorithm) predictive of these individualized estimates. The algorithm was then utilized in the screening set, evaluating for evidence of possible heterogeneous remedy consequences by dichotomizing the population into two teams based on their predicted reaction. To keep away from imbalanced teams or excessive outliers of responders or non-responders, the smallest subgroup dimension was set at 20%. We then analyzed the existence of heterogeneity in the noticed result estimates in between the two subgroups in the tests established by calculating the *p* benefit for interaction of remedy effect. Supplied that screening was carried out utilizing just fifty percent of the observations gathered at each and every interim evaluation timepoint, we outlined a a lot less conservative threshold of *p*_{conversation} < 0.2 as our criterion to screen for presence of heterogeneity.

If there was potential evidence of heterogeneity based on this analysis, sample size calculations were updated at that interim analysis timepoint by revising the expected effect size (under the assumption of predictive enrichment) at the original power and alpha levels (0.8 and 0.025, respectively, in both trials). This was done to assess whether prospective predictive enrichment and the associated decrease in the projected number of recruited individuals would provide sufficient power at a sample size equal to or smaller than the originally planned one. We performed sample size calculations assuming various levels of predictive enrichment, which ranged from enrolling 50% to 95% of all remaining candidates, in 5% increments. If there were several levels that met these criteria, we ultimately chose the predictive enrichment level that minimized the required sample.

Assuming the above, over the subsequent period, the probability of enrollment was conditioned on the anticipated benefit, assessed by applying the most recent model to each potential candidate’s baseline characteristics. Alternatively, if there was no evidence of heterogeneous treatment effect, or the proposed enrichment in enrollment would not be adequately powered at the revised sample size, we proceeded as originally planned and continued with standard enrollment for that time-period without predictive enrichment. This process was repeated at each interim analysis timepoint. There was no assessment for futility.

Given the stochastic nature of the algorithm, the simulation was repeated *r* = 10 times. For reference, we present the observed outcomes of the full trial population at the same timepoints. To enable direct comparison between the different simulations, the final analysis was performed at the timepoint at which all primary outcome events had occurred in the original trial population.

### Data-preprocessing

Our analysis included 62 phenotypic features recorded at baseline in IRIS (Supplementary Table 2), and 82 baseline features in SPRINT (Supplementary Table 3), as per our prior work^{11}. At every point, pre-processing steps, including imputation, were performed independently for each patient subset to avoid data leakage. Baseline features with greater than 10% missingness were removed from further analysis. To avoid collinearity of continuous variables, we calculated pairwise correlations across variables, and wherever pairs exceeded an absolute correlation coefficient of 0.9, we excluded the variable with the largest mean absolute correlation across all pairwise comparisons. Continuous variables also underwent 95% winsorization to reduce the effects of extreme outliers, whereas factor variables with zero variance were dropped from further processing. Next, we imputed missing data using a version of the random forest imputation algorithm adapted for mixed datasets with a maximum of five iterations (function MissForest, from Python package missingpy v0.2.0). Factor variables underwent one-hot encoding for ease of processing with downstream visualization and machine learning algorithms.

### Computational trial phenomaps

Once the dataset for a given simulation at a specified time-point was created, we computed a dissimilarity index that classified individuals based on their detailed clinical characteristics according to Gower’s distance. Gower’s method computes a distance value for each pair of individuals. For continuous variables, Gower’s distance represents the absolute value of the difference between a pair of individuals divided by the range across all individuals. For categorical variables, the method assigns “1” if the values are identical and “0” if they are not. Gower’s distance is ultimately calculated as the mean of these terms^{43}. At this point, the phenotypic architecture of the trial can be visualized using uniform manifold approximation and projection (UMAP)^{44}, a method that constructs a high-dimensional graph and then optimizes a low-dimensional graph to be as structurally similar as possible. UMAP aims to maintain a balance between the local and global structure of the data by decreasing the likelihood of connection as the outwards radius around each data point increases, thus maintaining the local architecture while ensuring that each point is connected to at least its closest neighbor and ensuring a global representation^{44}.

### Defining individualized hazard estimates

To extract personalized estimates of predicted benefit with pioglitazone or intensive SBP control, versus placebo or standard SBP reduction, respectively, for each individual included in each interim analysis, we applied weighted estimation in Cox regression models^{45}. We used all pertinent outcome events for the selected population (in the training set) with censoring at the time of the specified interim analysis. With every iteration of this regression around each unique individual, every study participant was assigned unique weights based on the phenotypic (Gower’s) distance from the index patient of that analysis. To ensure that patients phenotypically closer to the index patient carried higher cumulative weights than patients located further away, we applied a cubic exponential transformation of the similarity metric, defined as (*1-Gower’s distanc*e). These values were further processed through a Rectified Linear Unit (ReLU) function prior to their inclusion as weights in the regression models. This allowed us to simultaneously model an exponential decay function and control the impact of low values. From each personalized Cox regression model (fitted for each unique participant with individualized weightings as above), we extracted the natural logarithmic transformation of the hazard ratio (log HR) for the primary outcome for the intervention versus control.

### Machine learning estimation of individualized benefit

To identify baseline features that are important in determining the personalized benefit of the studied intervention relative to control (described by the individualized log HR), an extreme gradient boosting algorithm (known as XGBoost based on a tree gradient booster) was fitted with simultaneous feature selection based on the Boruta and SHAP (SHapley Additive exPlanations) methods. Briefly, the Boruta method creates randomized (permuted) versions of each feature (called “shadow features”) before merging the permuted and original data. Once a model is trained, the importance of all original features is compared to the highest feature importance of the shadow features. This process was repeated for *n* = 20 iterations, without sampling. SHAP was added as an approach to explain the output of the ML model, based on concepts derived from game theory. SHAP calculates the average marginal contributions for each feature across all permutations at a local level. With the addition of SHAP analysis, the feature selection further benefits from the strong additive feature explanations but maintains the robustness of the Boruta method^{46}. The testing data were further split into training and testing sets (with a random 80-20% split). We set our problem as a regression task using root mean squared error as our metric to evaluate our model’s accuracy during testing. Before training, the labels (i.e., previously calculated individualized log HR) underwent 95% winsorization to minimize the effects of extreme outliers. First, we fitted an XGBoost model using the Boruta algorithm to identify a subset of important baseline features, and then repeated this process to predict the individualized log HR, this time using only the selected features as input. Hyperparameter tuning was achieved through a grid search across 25 iterations (learning rate: [0.01, 0.05, 0.10, 0.15] maximal depth of the tree: [3, 5, 6, 10, 15, 20] fraction of training samples used to train each tree: 0.5 to 1.0 by 0.1 increments, number of features supplied to a tree: 0.4 to 1.0 by 0.1 increments random subsample of columns when every new level is reached: 0.4 to 1.0 by 0.1, number of gradient boosted trees: [100, 500, 1000]). We trained the model for a maximum of 1000 rounds, with an early stopping function every 20 rounds once the loss in the validation set started to increase. The importance of each feature was again visualized using a SHAP plot. SHAP values measure the impact of each variable considering the interaction with other variables. We visualized these using a SHAP summary plot, in which the vertical axis represented the variables in descending order of importance and the horizontal axis indicated the change in prediction. The gradient color denotes the original value for that variable (for instance, for binary variables such as sex, it only takes two colors, whereas for continuous variables, it contains the whole spectrum). In the Supplementary Table 6 we present a more extensive discussion of key parameters and the rationale behind their default values.

### Adaptive trial enrollment

Once a predictive model was generated at a given interim analysis timepoint, the model was prospectively applied to all trial candidates during the following trial period (time between two interim analyses). For example, a model trained at interim analysis timepoint #1 was applied to individuals screened between the interim analysis timepoints #1 and #2. Here, all patients that were included and enrolled in the original trial during this period were considered eligible candidates. This approach yielded individualized predictions of expected cardiovascular benefit with pioglitazone versus placebo (or intensive versus standard SBP reduction), with these predictions used to condition the probability of a given patient being enrolled in the simulation. Given that the predicted individualized log HR could have both negative (favoring pioglitazone or intensive SBP reduction) and positive values (favoring placebo or standard SBP reduction), predictions were multiplied by -1 and normalized to the [0, 1] range. The result (input *x*) was processed through a sigmoid transformation function (Eq. (1)) with a scaling factor of *k* = 10 where *z* = the ratio of the responders to non-responders, followed by squared transformation.

$$(1/(1+e^-10(x-(1-z)))^2$$

(1)

These numbers were used as sampling weights during the subsequent period to ensure that patients with higher predicted benefit were more likely to be enrolled over the next period. The process was repeated at each interim analysis timepoint.

### Power calculations

We simulated a superiority trial design, assuming a power of 80% and type I error of 0.025. We present alpha level adjustments for each time point, adjusted based on both the O’Brien-Fleming and alpha-spending Pocock methods (Supplementary Table 1). We performed our analyses using the *rpact* package in R, based on the expected event rates from our power calculations above and simulating three interim analyses with four total looks. These analyses are restricted to the primary endpoint.

### Negative control analysis

To assess the performance of our algorithm in the presence of an identical average treatment effect (ATE) but with absent (or at least randomly distributed) heterogeneous treatment effects, we randomly shuffled the baseline characteristics of each trial. This ensured that any effects of the baseline characteristics on the effectiveness of the intervention would be lost or be due to random variation.

### Stability analysis

To assess the robustness of our estimates, we defined three interim analysis timepoints for each study. At each timepoint, we randomly split the respective study population into a training and testing set (50:50 split) and repeated the process 100 times (using 100 random seeds). Across these 100 iterations, each observation (study participant) was roughly represented in the training set in half of all splits, and in the testing set in the remaining half. This enabled the variation of the demographic and baseline characteristics of each training population. For each iteration, we followed the same phenomapping approach presented in our work, estimated individualized hazard ratios, fit an extreme gradient boosting (XGBoost) regressor, and subsequently used this to infer personalized effect estimates in the testing set. As done in our main analysis, individuals in the testing set were then ranked based on their predicted responses. Using this approach, we defined the following stability metrics:

*Concordance of relative ranks based on XGBoost-defined personalized treatment effects* a metric reflecting the stability of relative ranks for two random observations/study participants. For a random patient A and patient B, this reflects the probability that patient A will be consistently ranked as a higher responder than patient B (or vice versa) across simulations, when both patients A and B are present in the testing set. A value of 1 (number of times A > B is equivalent to the times B > A) signifies that there is no regular pattern in the relative ranks, whereas better values advise bigger concordance in the relative ranks. For this, A typically refers to the individual who is most generally stated as a increased responder than patient B. This permits the calculation of a concordance metric (C(A,B)) for every single pair of persons (Eq. (2)). The necessarily mean concordance throughout all attainable pairs of people is then calculated based on Eq. (3) which estimates the regular of the upper triangle of the concordance matrix (as further explained in Supplementary Figure 1). Making use of this concept, a concordance ratio can then be computed based on the number of occasions there was concordance vs discordance in the relative ranks across all observations. 95% Self-assurance intervals are calculated applying the bootstrapping process with *n* = 1000 replications.

$$C(A,B)=still left{get startedarrayc1,textif,C_ge =C_ge ^primary ,\ fracC_ge C_ge ^primary ,textual contentif,C_ge ,>, C_ge ^prime \ fracC_ge ^key C_ge ,textotherwiseconcludearrayideal.$$

(2)

$$textaverage,textual contentconcordance,textual contentodds=fracmathopsum nolimits_i=1^n-1,mathopsum nolimits_j=i+1^n,M_ijfracnleft(n-1ideal)2$$

(3)

*Regularity of Boruta SHAP-defined options*: Alongside with the concordance of relative prediction ranks, we also history the capabilities chosen by the Boruta SHAP algorithm at every single interim assessment timepoint, and then present the relative frequency with which a given function provides throughout iterations and timepoints. This tactic assesses the general regularity in the functions identified as potential predictors of heterogeneous cure effects across distinctive time intervals. It really should be observed that characteristics are not unbiased of every single other, and therefore the algorithm may perhaps detect distinctive capabilities in the course of each iteration that collectively define a identical client phenotype.

### Statistical assessment

Categorical variables are summarized as quantities (percentages) and ongoing variables as mean ± standard deviation (unless specified in any other case) or median with IQR (Q1 to Q3) except if specified or else. Categorical groupings (i.e., examine arm assignment) across adaptive runs and the original trial were being in contrast utilizing the chi-sq. exam. Survival analyses were performed by fitting a Cox regression design for the time-to-most important end result utilizing the procedure arm as an unbiased predictor. While estimating individualized treatment results, each individual observation was weighted centered on the calculated similarity metric to the index client of each and every investigation. Among-subgroup analyses for heterogeneity of cure outcome had been done by computing a *p* value for conversation. Simulation-amount counts and level estimates have been in contrast to the respective figures/counts from the initial trial using just one-sample *t*-assessments for the counts of remaining research contributors, alpha was established at .025 provided the one-sided nature of the examination normally, alpha was established at .05. We graphically summarized the counts of enrolled members, primary consequence situations, Cox regression-derived effect estimates (unadjusted, for each a person of the main and secondary results), and *p* values at just about every one of the interim timepoints (with error bars denoting the typical error of the necessarily mean), in addition to the closing investigation timepoint. Cumulative incidence curves for the primary end result stratified by the enrollment period and simulation analysis were being graphically introduced. Every one particular was in comparison to the unique demo subset for the similar period using the log-rank statistic. An exploratory hierarchical consequence examination was assessed employing the gain ratio for prioritized outcomes and corresponding 95% self esteem intervals^{47}. As reviewed higher than, for the principal consequence, we simulated a superiority trial design and style, assuming a electrical power of 80% and style I error of .025. Statistical assessments have been two-sided with a stage of significance of .05, unless specified if not. Analyses were carried out employing Python (edition 3.9) and R (model 4.2.3), and noted according to STROBE guidelines^{48}. Bundle utilized for examination are specified in the Supplementary Methods.