This repository is for the data and code presented in the paper published here (to be added).
Firstly, you will need to install the VarReg
package and
a few others to perform the code below.
This dataset is located within the gamlss
package. CD4
is a type of white blood cell, and in this dataset, it has been measured
in uninfected children born from HIV-1 infected women. The dataset
contains 609 measurements of CD4 cell counts and the child’s age at
which the measurements were taken.
data("CD4")
str(CD4)
#> 'data.frame': 609 obs. of 2 variables:
#> $ cd4: num 387 2183 904 1681 656 ...
#> $ age: num 4.5 0.83 2.06 1.44 2.67 1.17 1.94 1.72 2.54 1.66 ...
Lets reproduce the graphic in the paper, showing that at younger ages there is more variation in the CD4 counts than at older ages, demonstrating heteroscedasticity.
ggplot(data=CD4, aes(y=cd4, x=age)) +
geom_point()+
xlab("Age of child in years")+
ylab("CD4 cell count")+
theme_minimal()
Let us fit a linear model in the mean and variance model, like so
cd4.linear<-semiVarReg(y = CD4$cd4,
x=CD4$age,
meanmodel = "linear",
varmodel = "linear",
maxit=10000)
key outputs for model fit:
Key estimates from model:
plotVarReg
function:Searching to a max of 7 knots in the mean and variance, with maximum
iterations (maxit
) of 100 to minimise time taken for the
search:
cd4.best <- searchVarReg(y=CD4$cd4,
x=CD4$age,
maxknots.m = 7,
maxknots.v = 7,
selection="AIC",
maxit=100)
lets look at the AIC table to identify where the best model is located:
Mean_zero | Mean_constant | Mean_linear | Mean_Knot0 | Mean_Knot1 | Mean_Knot2 | Mean_Knot3 | Mean_Knot4 | Mean_Knot5 | Mean_Knot6 | Mean_Knot7 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Var_constant | 9750.8 | 9205.7 | 9044.1 | 8995.6 | 8997.3 | 8977.0 | 8972.1 | 8967.6 | 8966.3 | 8964.8 | 8964.6 |
Var_linear | 9662.4 | 9157.1 | 8999.8 | 8919.5 | 8916.2 | 8893.4 | 8889.8 | 8886.4 | 8885.6 | 8884.5 | 8884.7 |
Var_Knot0 | 9554.9 | 9157.1 | 8947.8 | 8839.2 | 8834.9 | 8862.5 | 8890.6 | 8905.7 | 8912.8 | 8916.5 | 8920.8 |
Var_Knot1 | 9569.4 | 9071.5 | 8904.2 | 8843.9 | 8843.7 | 8827.6 | 8825.4 | 8823.6 | 8823.3 | 8822.8 | 8823.6 |
Var_Knot2 | 9529.3 | 8998.8 | 8897.0 | 8842.4 | 8843.9 | 8810.5 | 8804.0 | 8799.1 | 8797.8 | 8797.3 | 8798.5 |
Var_Knot3 | 9524.5 | 8989.1 | 8892.3 | 8845.5 | 8847.5 | 8809.0 | 8802.2 | 8797.2 | 8796.2 | 8795.6 | 8796.8 |
Var_Knot4 | 9526.0 | 8989.6 | 8893.1 | 8848.0 | 8849.8 | 8813.2 | 8805.8 | 8801.3 | 8800.6 | 8800.2 | 8801.3 |
Var_Knot5 | 9525.9 | 8985.4 | 8894.0 | 8849.5 | 8851.4 | 8809.1 | 8803.7 | 8800.0 | 8799.1 | 8798.9 | 8800.1 |
Var_Knot6 | 9527.4 | 8987.0 | 8895.1 | 8851.7 | 8853.6 | 8810.9 | 8804.9 | 8801.3 | 8800.6 | 8800.4 | 8801.4 |
Var_Knot7 | 9530.4 | 8988.6 | 8896.6 | 8852.4 | 8854.2 | 8813.2 | 8807.9 | 8804.3 | 8803.4 | 8803.2 | 8804.4 |
The best model is saved within the cd4.best
list. The
key estimates from the best model are:
cd4.best$best.model$knots.m
#> [1] 6
cd4.best$best.model$knots.v
#> [1] 3
cd4.best$best.model$mean
#> Intercept M_Knt6_Base1 M_Knt6_Base2 M_Knt6_Base3 M_Knt6_Base4 M_Knt6_Base5 M_Knt6_Base6 M_Knt6_Base7
#> 32.98833 1597.06078 713.32197 582.58478 472.08281 291.50092 269.24506 220.66235
#> M_Knt6_Base8
#> 155.83254
cd4.best$best.model$variance
#> Intercept V_Knt3_Base1 V_Knt3_Base2 V_Knt3_Base3 V_Knt3_Base4 V_Knt3_Base5
#> 40801.4581 411830.6370 109290.5901 2784.9777 914.4245 -25616.3670
ie. 6 knots in mean and 3 in variance, with parameter estimates as given. We can then plot this best model:
From these residuals, there still appears to be deviations from normality in the residuals, again seen in both the Q-Q plot and the histogram. Therefore the incorporation of a shape parameter should be investigated.
lssVarReg
functionIn order to fit models with a shape parameter, we utilise the
lssVarReg()
function. Firstly, let us fit a constant shape
parameter with 6 knots in the mean and 3 knots in the variance, that
is,
Let us fit two models, one with a constant shape parameter and one with a linear model in the shape, with the following code.
con.shape<-lssVarReg(y=CD4$cd4,
x=CD4$age,
locationmodel="semi",
knots.l=6,
scale2model="semi",
knots.sc=3,
mono.scale = "inc",
shapemodel="constant",
maxit=1000 )
linear.shape<-lssVarReg(y=CD4$cd4,
x=CD4$age,
locationmodel="semi",
knots.l=6,
scale2model="semi",
knots.sc=3,
shapemodel="linear",
int.maxit=1000,
print.it=TRUE,
maxit=1000 )
If we want to speed up the model, we can use starting estimates (from our best model) and parameter space restrictions, like so for the constant model:
con.shape_faster <- lssVarReg(y=CD4$cd4, x=CD4$age,
locationmodel="semi",
knots.l=6,
scale2model="semi",
knots.sc=3,
shapemodel="constant",
para.space="positive",
location.init = cd4.best$best.model$mean,
scale2.init = cd4.best$best.model$variance,
int.maxit = 10000,
maxit=1000 )
And compare the models as we did in the paper:
data.frame(Model = c("No shape", "Constant shape", "Linear shape"),
AIC = c(cd4.best$best.model$aic, con.shape$aic, linear.shape$aic),
BIC = c(cd4.best$best.model$bic, con.shape$bic, linear.shape$bic)) %>%
kbl(digits=1,
align='c',
caption = "Comparison of AIC from shape models for CD4 cell counts.") %>%
kable_paper(full_width = TRUE )
Model | AIC | BIC |
---|---|---|
No shape | 8795.6 | 8861.8 |
Constant shape | 8697.7 | 8768.3 |
Linear shape | 8687.1 | 8762.1 |
And compare the residuals:
n<-length(CD4$age)
knot6<-bs(CD4$age, df= 8, degree=2)
knot3<-bs(CD4$age, df= 5, degree=2)
##normal model
npred.mean<-colSums(t(cbind(rep(1,n),knot6))*cd4.best$best.model$mean)
npred.var<-colSums(t(cbind(rep(1,n),knot3))*cd4.best$best.model$variance)
stand.res<- (CD4$cd4-npred.mean)**2/(npred.var)
chiq<-qchisq(c(1:n)/(n+1), df=1)
#constant residuals
con.res<-lss_calc(con.shape)
con.res<-con.res[order(con.res$CD4.age),]
#linear residuals
lin.res<-lss_calc(linear.shape)
lin.res<-lin.res[order(lin.res$CD4.age),]
par(mfrow=c(1,3))
plot(chiq, sort(stand.res), main=NULL, ylim=c(0, 10), pch=19, ylab="Squared standardised residuals after fit with no shape", xlab="Chi-Square(1) quantiles")
abline(0,1)
mtext('A', side=3, line=2, at=0, adj=3)
plot(chiq, sort(con.res$stand.res2), main=NULL, ylim=c(0, 10),pch=19, ylab="Squared standardised residuals after LSS (constant) fit", xlab="Chi-Square(1) quantiles")
abline(0,1)
mtext('B', side=3, line=2, at=0, adj=3)
plot(chiq, sort(lin.res$stand.res2), main=NULL, ylim=c(0, 10),pch=19, ylab="Squared standardised residuals after LSS (linear) fit", xlab="Chi-Square(1) quantiles")
abline(0,1)
mtext('C', side=3, line=2, at=0, adj=3)