There are three main algorithms within the package.

models for the mean and the variance
models for the mean and the variance, taking into account censoring of the outcome variable
models for the location, scale and shape

Here we will discuss some of the specifics of each of these algorithms.

Mean and Variance models

The statistical methods for the mean and variance model given in Robledo and Marschner (2021). Briefly, let a set of covariates be included on which the mean depends be P, giving z_i = (z_i1, …, z_iP)^T and the number of covariates on which the variance depends be Q giving x_i = (x_i1, …, x_iQ)^T. This leads to the general model

where β = (β₀, ..., β_P)^T and α = (α₀, ..., α_Q)^T. The mean is estimated using linear regression, weighted by the inverse of the current estimate of the variance. The covariates (z_i) are fit in this model, and updated estimates of β̂_i are obtained, where $\hat{\boldsymbol{\beta}}=(\hat{\beta}_0, .., \hat{\beta}_P)$. This additional step to fit the mean model in both the E-step and M-step converts the EM algorithm to a ECME (Expectation Conditional Maximization Either) algorithm (Liu and Rubin 1994; McLachlan and Krishnan 2007), and is given in Figure @ref(fig:ecmemean) below.

By using the idea that the additive variance model is considered to be a latent outcome model in which the observed outcome is the sum of independent outcomes, we allow each member of the covariate vector x_i to be additional latent variables to be estimated. Therefore, with a total of Q variables in the variance model, there are Q + 1 independent, unobserved, latent variables, where X_i = Y_i + Z_i1 + ... + Z_iQ.

In order to search the entire parameter space, a process of using sets of constrained EM algorithms are performed, an instance of a combinatorial EM (CEM) algorithm. Without loss of generality, assume that each covariate is scaled such that x_i ∈ [0, 1]. Therefore, the constant term in the variance model (α₀) will be the variance when all other variance parameters are zero. In order to search for non-positive slope, an EM algorithm is fit using the covariate 1 − x_i in place of x_i, and thus the EM algorithm is maximising the log-likelihood over the parameter space estimates for α₀ ≥ 0 and α_q < 0 for q = 1, 2, ..., Q. By repeating this for all possible covariate combinations, we have a total of 2^Q EM algorithms in order to search the entire parameter space. The MLE will then be the $\hat{\boldsymbol{\theta}}$ from the EM algorithm that achieved the highest log-likelihood, where θ is a vector of unknown parameters, θ = (β, α). These 2^Q models are referred to as the family of complete data models, and the general algorithm is again an instance of a CEM algorithm. Standard errors are obtained from the Fisher information matrix (Robledo and Marschner 2021), or by bootstrapping.

The ECME algorithm for the estimation of the mean and variance

Mean and Variance models with censored outcome data

Censoring often occurs in the analysis of biomarker data, where the samples are measured with an assay that usually have detectable limits, with data at both upper and lower limits reached. When biomarker data can be below the detectable limit (left censored), and/or above the detectable limit (right censored), differences between two biomarker readings will also be censored. We need to be able to accurately model such data, particularly with a steady increase in the numbers of biomarkers investigated in clinical trials, and potentially large amounts of data above or below detectable limits. Some studies report over one-third of their biomarker data as below the detectable limit (White et al. 2014).

If X_i^* is our true outcome data, let X_i be the outcome data observed with censoring, and let us account for lower and upper limit censoring for completeness. This censored data given an additional level of missingness in our CEM algorithm, and thus an extra level of our complete data. If X^(L) is the lower limit of detection and X^(U) is the upper limit then

We will also need to specify c, the censoring indicator, where

Assuming still that the covariates x_iq are scaled such that x_iq ∈ [0, 1], let us now fit the same model as Equation @ref(eq:general) however with X_i^* as our outcome data. If Φ is the standard normal cumulative distribution function and ϕ is the standard normal probability density function, then two useful functions are $R(z)=\dfrac{\phi(z)}{1-\Phi(z)}$ and $Q(z)=\dfrac{-\phi(z)}{\Phi(z)}$ (Mills 1926). Additionally, when z tends to negative infinity, the approximation $\dfrac{\phi(z)}{\Phi(z)} \approx -z$ can be used. The likelihood function for the observed data model with censored data is where The corresponding log-likelihood is then where Now in the CEM algorithm, the uncensored outcome variable X_i^* will be assumed to be composed of Q + 1 independent, unobserved, latent variables: This means that we will need to find the conditional expectations of Y_i² and Z_i1², ..., Z_iQ², given the observed outcome value, X_i, which may be censored.

From (Aitkin 1964), we see that for bivariate normal variables M and N, where the conditional distribution for M given censoring in N can be obtained. In particular, given an upper limit, a, it was shown that: where ρ is the correlation between M and N and R(a) was defined previously. We know from Equation @ref(eq:clatents) that the variance for Y_i is α₀, and the variance for X_i is σ²(α). So, given Equation @ref(eq:clatents) and the general result in Equation @ref(eq:fcensor), it follows that, We can then determine the correlation between Y_i and X_i: We can also determine the correlation for each of the Z_iq and X_i, using a similar argument: Now we apply the correlation found in Equation @ref(eq:rho1) to Equation @ref(eq:fcens2), in order to obtain our conditional expectation of Y_i²: The expectation of each of the Z_iq² follows the same argument, using Equation @ref(eq:rho2).

In order to apply these expectations to left censored data, we use the function Q(z) rather than R(z). For censored outcome data, the E-step involves the calculation of the conditional expectations as defined above:

remembering that θ = (β, α), and X_i is defined in Equation @ref(eq:xi). The conditional expectations associated with the Z_iq follow the same principles,

The next step of the algorithm involves calculating the updated estimates of θ, θ̂^new. The estimates for $\hat{\boldsymbol{\alpha}}$ for fixed β are obtained by the following,

Previously in the mean and variance model (Robledo and Marschner 2021), a weighted linear regression was fit in order to obtain an updated estimate of the mean parameters ($\hat{\boldsymbol{\beta}}$) at each iteration, for fixed α = α̂^new. In order to estimate the mean model with censored outcome data, we need to fit a heteroscedastic censored linear regression model. This can be achieved by first standardizing the data, and then performing a homoscedastic censored linear regression at each iteration. In order to standardize the data, we divide by the standard deviation to obtain

A homoscedastic censored linear regression for $\dfrac{X_i}{\sigma(\boldsymbol{\alpha})}$ is then performed, against covariates $\dfrac{z_{ip}}{\sigma(\boldsymbol{\alpha})}$, for fixed α = α̂^new. This can easily be implemented in standard software for censored normal linear regression, which can be performed with survreg in R. Once the censored regression is performed, the β̂ estimates are back transformed by multiplying by the standard deviation σ(α) for our current fixed α. This process is continued until convergence of the parameter estimates.

This algorithm is an instance of the ECME algorithm, and is summarised schematically in Figure xx. As detailed in Section xxx and xxx, the algorithm maximises the log-likelihood over a restricted parameter space, and will need to be run multiple times in order to maximise over the full parameter space. Thus a total of 2^Q ECME algorithms must be run, once for each combination of the q^th variance covariate taking the value x_i or 1 − x_i, for q = 1, 2, ..., Q. The log-likelihood is then maximised over the entire parameter space with this family of ECME algorithms.

The ECME algorithm for the estimation of the mean and variance with censored outcome data

Standard errors are implemented by bootstrapping. In practice, this may take some time for datasets with many parameters fit in the variance model. This is due to the number of parameter spaces, and thus ECME algorithms, that must be run.

Models for the location, scale and shape

Skew-normal distribution

The skew-normal distribution is a distribution that extends the normal distribution to allow for non-zero skew (Azzalini 2013). This distribution has three parameters, the location parameter ξ (ξ ∈ (−∞, ∞)), the scale parameter ω (ω ∈ (0, ∞)) and the shape parameter ν (ν ∈ (−∞, ∞)). If ν < 0, the distribution is left skewed, and if ν > 0 then the distribution is right skewed. The normal distribution is recovered with ν = 0.

The probability density function of the skew normal is

where ϕ and Φ are the density and distribution functions of the standard normal distribution, respectively. If a random variable X has a skew-normal distribution with parameters (ξ, ω, ν), this is written as

Maximum likelihood estimation

If X₁, ..., X_n are independent and identically distributed observations from SN(ξ, ω², ν), the likelihood function for the sample is given as and the corresponding log-likelihood (omitting the constant term) reduces to

As will be explained below, the likelihood in Equation @ref(eq:llsn) is a censored Gaussian likelihood when viewed as a function of ξ and ω, and a probit regression likelihood when viewed as a function of ν. This is useful as it allows a straightforward cyclic coordinate ascent algorithm to be used to obtain the MLE. For optimisation of a multi-variable function, a cyclic coordinate ascent algorithm involves the optimisation of a function with respect to one variable holding the other variables constant, and then repeating with respect to each of the variables (Lange 2013). In our context, the cyclic coordinate ascent algorithm will cycle between a censored Gaussian model (keeping ν constant), and a probit regression (keeping ξ and ω constant).

Although there already exists an algorithm for obtaining the MLE in the skew-normal model in the SN package (Azzalini 2016) in R, we introduce this new method as it more easily generalises to regression modelling.

In order to see that @ref(eq:llsn) is equivalent to a probit regression likelihood when viewed as a function of ν alone, we note that @ref(eq:llsn) is proportional to the following function of ν, Now let the residuals be defined as follows, for fixed ξ and ω,

Then, @ref(eq:l2) is This is exactly a probit regression likelihood with the binary outcome and the linear predictor being ν|r_i|. Thus, for fixed ξ and ω, @ref(eq:llsn) can be maximised as a function of ν by fitting a probit regression on the binary outcome V_i, with no intercept and a single covariate |r_i|.

Next, let us look at the estimation of the ξ and ω parameters, holding ν constant. If we examine @ref(eq:llsn), we will now show how this can be considered to be a censored Gaussian likelihood, where the first component is with regards to the uncensored data, and the second component is with regards to the censored data (and multiplied by our constant ν). Note as well, that each observation contributes to both the censored and uncensored components of the likelihood.

For the purpose of performing a censored regression, we will utilise the algorithm developed above for censored outcome data. In order to make use of this algorithm, the data needs to be manipulated to ensure that each observation is contributing to both components of the likelihood. Firstly, we create a censoring indicator, c, of length 2n. The first n elements are set to 0, and then elements n + 1, ..., 2n are set to −1 to indicate left censoring. Next, we create a vector of 1s called I of length 2n, and we also create a new outcome vector called D, where observations 1, ..., n are X, and observations n + 1, ..., 2n are also X. This corresponds to $$ D= \left( \begin{array}{c} X_1 \\ \vdots \\ X_n \\ X_1 \\ \vdots \\ X_n \end{array} \right), ~~~ I= \left( \begin{array}{c} 1 \\ \vdots \\ 1 \\ 1 \\ \vdots \\ 1 \end{array} \right), ~~ \text{and} ~~ c= \left( \begin{array}{c} 0 \\ \vdots \\ 0 \\ 1 \\ \vdots \\ 1 \end{array} \right). $$ Now, the likelihood for the observed data model is where which equates to Now we can compare Equations @ref(eq:llsn) and @ref(eq:ll2), and see that they follow a similar form, where D_i = X_i and I_i = ξ.

So, we can fit our censored regression model to update ξ̂ and ω̂ with our new outcome data D_i, using the censoring indicator (c_i) to indicate the censored observations. Note that the censored data observations require I_i and D_i to be multiplied by the constant ν. The location model contains only the vector I_i as a covariate, that is, with no intercept term. Meanwhile, the scale model contains an intercept only. The parameter from the I_i covariate is the estimate for ξ̂ and the intercept estimate from the scale model is ω̂.

Once we have calculated our current estimates for ξ̂, ω̂ and ν̂, we then examine convergence previously (Robledo and Marschner 2021). If convergence has not been met, we iterate our cyclic coordinate ascent algorithm by re-calculating the residuals, performing another probit regression and then performing our censored Gaussian regression, until we do reach our convergence criteria. It is also of importance to note that whether we use the censored normal algorithm above, or another censored normal algorithm such as that provided in the SN package, we obtain the same result.

Extension to LSS regression model

Now that an algorithm has been developed to estimate the ξ, ω and ν parameters in a skew normal model, let us extend this to incorporate a regression model for each of the three parameters. In this case the model becomes where we have covariates s_ip (p = 1, ..., P) for the location model, covariates l_iq (q = 1, ..., Q) for the scale model and covariates u_ik (k = 1, ..., K) for the shape model.

The likelihood for this model is and the log-likelihood reduces to where ξ = (ξ₁, ξ₂, ..., ξ_P), ω = (ω₁, ω₂, ..., ω_Q) and ν = (ν₁, ν₂, ..., ν_K).

Similarly to the approach taken in the previous section, the likelihood in @ref(eq:llext) can be seen as a censored Gaussian likelihood when viewed as a function of ξ and ω, and a probit regression likelihood when viewed as a function of ν. Therefore, we can take the same approach as before, with an extension of the cyclic coordinate ascent algorithm.

The steps outlined in the section above may be generalised as follows. First, the residuals can be defined as follows, for fixed ξ and ω, Now, As in section 3.2 above, this is exactly a probit regression likelihood with the binary outcome (V_i). Therefore for fixed ξ and ω, the likelihood @ref(eq:llext) can be maximised as a function of ν by fitting a probit regression on the binary outcome V_i with the covariate |r_i|, and each covariate of interest u_ik multiplied by |r_i|, with no intercept term. The estimate of ν₀ is obtained from the estimate from the |r_i| parameter in the model, and each of the estimates of ν_k are obtained from the parameter for the relevant u_ik covariate in the model.

Next, let us consider the estimation of ξ and ω, while maintaining ν constant. Firstly, similar to Section 3.2, we must create our censoring indicator c and our new outcome vector D. Also, we must create a new variable for each of the covariates s_ip which is of length 2n. For each of these covariates, the values i = 1, ..., n are the values from s_ip i = 1, ..., n, and the values i = n + 1, ..., 2n are also s_ip. Let us call these new covariates of length 2n, z_ip.

In the same manner, the scale covariates are each manipulated to be of length 2n. Each covariate, l_iq, is duplicated so that the first set of values, i = 1, ..n, are the respective value from l_iq. Then the next values, i = n + 1, .., 2n, are also the respective value from l_iq. Let us call these new covariates x_iq.

Section 2 describes the extension for the censored regression model for μ and σ², so our observed data likelihood for this extended model is where which equates to

Once the current estimates for ξ, ω² and ν are obtained, the cyclic coordinate ascent algorithm continues to cycle between a probit regression and a censored normal regression until convergence. The algorithm is summarised schematically in Figure @ref(fig:cyclicascent). In this algorithm, estimation of standard errors is obtained by bootstrapping only.

The cyclic coordinate ascent algorithm for the estimation of the location, scale and shape

References

Aitkin, M. A. 1964. “Correlation in a Singly Truncated Bivariate Normal Distribution.” Psychometrika 29 (3): 263–70. https://doi.org/10.1007/BF02289723.

Azzalini, Adelchi. 2013. The Skew-Normal and Related Families. Cambridge University Press.

———. 2016. The R Package sn: The Skew-Normal and Skew-t Distributions (Version 1.4-0). Università di Padova, Italia. http://azzalini.stat.unipd.it/SN.

Lange, Kenneth. 2013. Optimization. Springer.

Liu, Chuanhai, and Donald B. Rubin. 1994. “The ECME Algorithm: A Simple Extension of EM and ECM with Faster Monotone Convergence.” Biometrika 81 (4): 633–48. https://doi.org/10.2307/2337067.

McLachlan, Geoffrey J., and Thriyambakam Krishnan. 2007. The EM Algorithm and Extensions. New York: Wiley. https://doi.org/10.1002/9780470191613.

Mills, John P. 1926. “Table of the Ratio: Area to Bounding Ordinate, for Any Portion of Normal Curve.” Biometrika 18 (3/4): 395–400. https://doi.org/10.2307/2331957.

Robledo, K. P., and I. C. Marschner. 2021. “A New Algorithm for Fitting Semi-Parametric Variance Regression Models.” Computational Statistics. https://doi.org/10.1007/s00180-021-01067-6.

White, Harvey D., Andrew Tonkin, John Simes, Ralph Stewart, Kristy Mann, Peter Thompson, David Colquhoun, et al. 2014. “Association of Contemporary Sensitive Troponin i Levels at Baseline and Change at 1 Year with Long-Term Coronary Events Following Myocardial Infarction or Unstable Angina. Results from the LIPID Study (Long-Term Intervention with Pravastatin in Ischaemic Disease).” Journal of the American College of Cardiology 63 (4): 345–54. https://doi.org/10.1016/j.jacc.2013.08.1643.