Exercise 14. Non-collapsibility of proportional hazards models
We simulate for time-to-event data assuming constant hazards and then investigate whether we can estimate the underlying parameters. Note that the binary variable \(X\) is essentially a coin toss and we have used a large variance for the normally distributed \(U\).
You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest")
for each package you require.
We can use dagitty
to graph the causal diagram:
set.seed(12345)
d <- with(list(n=1e4), {
x <- rbinom(n, 1, 0.5)
u <- rnorm(n, 0, 3)
t <- rexp(n, exp(-5+x+u))
c <- runif(n, 0, 10)
y <- pmin(t, c)
delta <- (t < c)
data.frame(y,x,u,delta)
})
head(d)
## y x u delta
## 1 1.809348 1 -2.0774167 FALSE
## 2 1.251765 1 3.3740868 TRUE
## 3 2.078184 1 0.5236054 FALSE
## 4 3.528797 1 -5.9034201 FALSE
## 5 6.659236 0 -4.9328152 FALSE
## 6 5.366579 0 -0.3293542 FALSE
(a) Fitting models with both \(X\) and \(U\)
For constant hazards, we can fit (i) Poisson regression, (ii) Cox regression and (iii) flexible parametric survival models. With both covariates, these models are expected to estimate the parameters for \(X\) and \(U\), with values close to 1:
##
## Call:
## glm(formula = delta ~ x + u + offset(log(y)), family = poisson,
## data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9137 -0.3774 -0.1355 -0.0274 3.9389
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.97595 0.05344 -93.11 <2e-16 ***
## x 0.96027 0.04372 21.96 <2e-16 ***
## u 0.99087 0.01091 90.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 14687.0 on 9999 degrees of freedom
## Residual deviance: 5811.2 on 9997 degrees of freedom
## AIC: 10257
##
## Number of Fisher Scoring iterations: 7
## Call:
## coxph(formula = Surv(y, delta) ~ x + u, data = d)
##
## n= 10000, number of events= 2220
##
## coef exp(coef) se(coef) z Pr(>|z|)
## x 0.96227 2.61764 0.04490 21.43 <2e-16 ***
## u 0.99034 2.69215 0.01461 67.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## x 2.618 0.3820 2.397 2.858
## u 2.692 0.3714 2.616 2.770
##
## Concordance= 0.922 (se = 0.002 )
## Rsquare= 0.525 (max possible= 0.98 )
## Likelihood ratio test= 7448 on 2 df, p=<2e-16
## Wald test = 4629 on 2 df, p=<2e-16
## Score (logrank) test = 5611 on 2 df, p=<2e-16
## Maximum likelihood estimation
##
## Call:
## stpm2(formula = Surv(y, delta) ~ x + u, data = d, df = 4)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## (Intercept) -14.565979 0.516079 -28.224 < 2.2e-16 ***
## x 0.960750 0.044878 21.408 < 2.2e-16 ***
## u 0.990848 0.014600 67.865 < 2.2e-16 ***
## nsx(log(y), df = 4)1 9.424531 0.494145 19.072 < 2.2e-16 ***
## nsx(log(y), df = 4)2 8.786272 0.328904 26.714 < 2.2e-16 ***
## nsx(log(y), df = 4)3 16.534497 1.013387 16.316 < 2.2e-16 ***
## nsx(log(y), df = 4)4 8.562140 0.206188 41.526 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 8520.997
## summary table for the coefficients for X
rbind(Poisson=coef(summary(fit1))["x",c("Estimate","Std. Error")],
Cox=coef(summary(fit2))["x",c("coef","se(coef)")],
stpm2=coef(summary(fit3))["x",c("Estimate","Std. Error")]
)
## Estimate Std. Error
## Poisson 0.9602659 0.04372345
## Cox 0.9622728 0.04490412
## stpm2 0.9607498 0.04487831
If we fit a time-varying hazard ratio for \(X\), we see that the hazard ratio looks reasonably constant.
fit <- stpm2(Surv(y,delta)~x+u,data=d,df=4, tvc=list(x=2))
plot(fit, type="hr", newdata=data.frame(x=0,u=0), var="x", ylim=c(1,4))
(b) Fitting models with only \(X\)
We now model by excluding the variable \(U\). This variable could be excluded when it is not measured or perhaps when the variable is not considered to be a confounding variable – from the causal diagram, the two variables \(X\) and \(U\) are not correlated and are only connected through the time variable \(T\).
##
## Call:
## glm(formula = delta ~ x + offset(log(y)), family = poisson, data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1690 -0.8129 -0.5805 -0.1687 4.8126
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.19636 0.03377 -94.66 <2e-16 ***
## x 0.51289 0.04341 11.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 14687 on 9999 degrees of freedom
## Residual deviance: 14544 on 9998 degrees of freedom
## AIC: 18988
##
## Number of Fisher Scoring iterations: 7
## Call:
## coxph(formula = Surv(y, delta) ~ x, data = d)
##
## n= 10000, number of events= 2220
##
## coef exp(coef) se(coef) z Pr(>|z|)
## x 0.48511 1.62436 0.04342 11.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## x 1.624 0.6156 1.492 1.769
##
## Concordance= 0.565 (se = 0.005 )
## Rsquare= 0.013 (max possible= 0.98 )
## Likelihood ratio test= 127.9 on 1 df, p=<2e-16
## Wald test = 124.8 on 1 df, p=<2e-16
## Score (logrank) test = 127.3 on 1 df, p=<2e-16
## Maximum likelihood estimation
##
## Call:
## stpm2(formula = Surv(y, delta) ~ x, data = d, df = 4)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## (Intercept) -9.883577 0.473287 -20.883 < 2.2e-16 ***
## x 0.485236 0.043423 11.175 < 2.2e-16 ***
## nsx(log(y), df = 4)1 7.458585 0.467805 15.944 < 2.2e-16 ***
## nsx(log(y), df = 4)2 6.323162 0.310307 20.377 < 2.2e-16 ***
## nsx(log(y), df = 4)3 13.212493 0.930871 14.194 < 2.2e-16 ***
## nsx(log(y), df = 4)4 5.541993 0.192043 28.858 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 15844.98
## summary table for the coefficients for X
rbind(Poisson=coef(summary(fit1))["x",c("Estimate","Std. Error")],
Cox=coef(summary(fit2))["x",c("coef","se(coef)")],
stpm2=coef(summary(fit3))["x",c("Estimate","Std. Error")]
)
## Estimate Std. Error
## Poisson 0.5128869 0.04341441
## Cox 0.4851147 0.04342341
## stpm2 0.4852363 0.04342328
We clearly see that the estimate for \(X\) is different in the models without \(U\). If we now allow the the hazard ratio to vary by time:
fit <- stpm2(Surv(y,delta)~x,data=d,df=4, tvc=list(x=2))
plot(fit, type="hr", newdata=data.frame(x=0), var="x", ylim=c(1,4))
We see that the hazard ratio is clearly time-varying, starting at an initial value at time = 0 of close to exp(1), but then declining rapidly. As discussed in class, this is an example of unmeasured heterogeneity, which is a normal random effect or, in this context, a log-normal frailty. Initially, there is no selection in the at-risk population, and the estimated marginal hazard ratio from the model without \(U\) is similar to the conditional hazrd ratio from the model that adjusts for \(U\). However, at later times, the at-risk population has been selected for those with a smaller frailty (because they were less likely to have had the event), and for a log-normal frailty the marginal hazard ratio is attenuated towards 1.
Let us stress that \(U\) is not a confounder. The issue is that fitting proportional hazards models with unmodelled heterogeneity for variables that are not confounders can lead to time-varying hazard ratios – and marginal estimates that do not have a causal interpretation.
(c) Rare vs common outcomes
set.seed(12345)
d <- with(list(n=1e4), {
x <- rbinom(n, 1, 0.5)
u <- rnorm(n, 0, 3)
t <- rexp(n, exp(-5+x+u))
c <- runif(n, 0, 1) # CHANGED FROM 10 TO 1
y <- pmin(t, c)
delta <- (t < c)
data.frame(y,x,u,delta)
})
##
## Call:
## glm(formula = delta ~ x + offset(log(y)), family = poisson, data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6528 -0.4527 -0.3585 -0.2266 4.5702
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.23991 0.06250 -35.839 <2e-16 ***
## x 0.69377 0.07688 9.024 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 6129.1 on 9999 degrees of freedom
## Residual deviance: 6043.0 on 9998 degrees of freedom
## AIC: 7557
##
## Number of Fisher Scoring iterations: 7
## Call:
## coxph(formula = Surv(y, delta) ~ x, data = d)
##
## n= 10000, number of events= 755
##
## coef exp(coef) se(coef) z Pr(>|z|)
## x 0.68466 1.98309 0.07688 8.905 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## x 1.983 0.5043 1.706 2.306
##
## Concordance= 0.583 (se = 0.009 )
## Rsquare= 0.008 (max possible= 0.737 )
## Likelihood ratio test= 83.85 on 1 df, p=<2e-16
## Wald test = 79.3 on 1 df, p=<2e-16
## Score (logrank) test = 82.45 on 1 df, p=<2e-16
## Maximum likelihood estimation
##
## Call:
## stpm2(formula = Surv(y, delta) ~ x, data = d, df = 4)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## (Intercept) -10.096088 0.649528 -15.5437 < 2.2e-16 ***
## x 0.684636 0.076881 8.9051 < 2.2e-16 ***
## nsx(log(y), df = 4)1 6.453301 0.640095 10.0818 < 2.2e-16 ***
## nsx(log(y), df = 4)2 5.610039 0.429274 13.0687 < 2.2e-16 ***
## nsx(log(y), df = 4)3 11.416226 1.274732 8.9558 < 2.2e-16 ***
## nsx(log(y), df = 4)4 5.043582 0.268584 18.7784 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 3975.012
## summary table for the coefficients for X
rbind(Poisson=coef(summary(fit1))["x",c("Estimate","Std. Error")],
Cox=coef(summary(fit2))["x",c("coef","se(coef)")],
stpm2=coef(summary(fit3))["x",c("Estimate","Std. Error")]
)
## Estimate Std. Error
## Poisson 0.6937665 0.07687819
## Cox 0.6846584 0.07688225
## stpm2 0.6846362 0.07688112