Some Genetic Regression Models for Multiple Quantitative End Points Data

Research Article

Austin Biom and Biostat. 2016; 3(1): 1029.

# Some Genetic Regression Models for Multiple Quantitative End Points Data

Ao Yuan* and Jaeil Ahn

Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, USA

*Corresponding author: Ao Yuan, Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, Washington D.C. 20057, USA

Received: May 28, 2015; Accepted: January 28, 2016; Published: February 08, 2016

## Abstract

Multiple endpoints data are common in practice. There are various statistical methods for the analysis of this type of data, however, genetic models for familial observations with multiple endpoints data are relatively few, and the existing methods are basically variations of the Elston-Stewart algorithm. Here we consider several joint statistical models for such data with quantitative measurements with a new algorithm, which is computationally more efficient than the existing method. The proposed method is detailed in some commonly used parametric, semi parametric and nonparametric settings for this type of data. For un-genotyped data, the commonly used models are the mixture and variance components models. We elaborate how these genetic models can be extended for multiple endpoints data with the proposed method

Keywords: Censoring; Endpoints data; Familial structure; Genotype; Missing observation

## Introduction

Endpoints data are observed responses from patients of some pre-specified clinical events of interests, such as death, loss of vision, occurrences of certain diseases, or other symptomatic events. In medical research, study participants are often followed for a long time, during which some participants may drop out early, so that random censorship may be present in the data. Such data have missing observations, which may be inhomogeneous across the patients. For example, in one patient we have observations on the lung cancer and kidney disease, and on another we have observations on lung cancer, diabetes and asthma. Analyses of such data largely fall into two categories: hypothesis testing (usually non-model based) and model inference. Here we concentrate on the model inference of such data.

For censored data modeling there are extensive literatures [1-8], just mention a few. For multiple endpoints data analyses, there are various statistical methods [9-13], for example. Wei and Glidden [14] provided an overview for some of the methods in this field.

Family genetic data differ from the ordinary data in that they are collected in familial units, often with varying structures and sizes, and with/without genotyping. These features make the models distinct. The key in the modeling is the familial dependence structure and the implementation of the genetic mechanism, existing methods are basically variations of the Elston-Stewart algorithm, which is a multi-level mixture model, and the computation is often challenging. Genetic models for multiple endpoints data are relatively limited. Here we consider some statistical joint models for such data with quantitative measurements with a new algorithm, which is a one-level mixture model, thus enhance the computation considerably. The parametric method is used when one has some confidence about the model specification. The semi parametric method can be used when there is not enough information about the full parametric model specification. The nonparametric method is used for the robustness of least model assumptions. We elaborate our methods for the parametric, semi parametric and nonparametric cases. The methods we describe below are valid for arbitrary pedigrees; however, in this article we focus on the simpler case of nuclear family for illustration.

In genetic analysis, the data contains genotypes, partially genotypes, or no genotypes. However, even if the data are genotyped, it is still of interest to know whether there are some other unknown gene(s) behind the response functioning. There are reports that with added unknown gene locus, the likelihood Akaike information reduced (e.g., [15], p.1091, [16], which makes sense, as correct parameter(s) added to the model will reduce its AIC), or the segregation analysis guided to some other gene(s) which deserve(s) further investigation. So even for the genotyped data, a segregation model is still of importance. It is also the general model including the genotyped data case. In the following we derive some commonly used regressive models for this case, including the parametric model, semiparametric proportional hazards model, nonparametric least squares model, variance components model and the competing risks model. Also, hypothesis testing on parameters of interest can be conducted using the likelihood ratio statistics based on the parametric models. Our aim here is to present several new parametric, semi parametric and nonparametric models for this familial data, and thus we focus derivations of the basic forms of these models. Implementations of these models and applications to real data will follow in our future work.

## Methods

Suppose there are d responses observed with some clinical events of interests, along with r covariates, for each member in a family. We concentrate on nuclear family structure for simplicity. In practice we only observe a subset of the responses and covariates for each patient in a family. Let yi=(yif,yim,yis)(i=1,…,n) be the vector of responses for the i-th nuclear family, with corresponding covariates xi=(xif,xim,xis) (i=1,…,n). Here yif a dif(<d) dimensional vector of responses of the father in the i-th family, which belongs to a dif dimensional subspace of the d-dimensional space, with a response non-missing indicator vector Iif and covariate non-missing indicator vector Jif. Similarly, yim denotes a vector of responses of the mother in the i-th family and yis=(yi1,…,yibi) is for offspring with each of the yij s has the same data structure as that for yif. For example, there are three responses to be observed, we have the first and third on the father, then Iif=(1,0,1), and dif=|Iif|=2 is its dimension or cardinality. If there are total of five covariates in the design, and we may only have the first, second, fourth covariates for the father, then Jif=(1,1,0,1,0), and rif=|Jif|=3 is its dimension. We assume random censorship. Let δif=(δif1,…, δifdif) be the censoring indicator of yif, i.e. δifj=1 if yifj is uncensored, and δifj=0 otherwise. Similar notations are used for the mother. For the off springs, yis = (yi1,...,yki) denote the response vector, with yij be dij dimensional observation for the jth sib, with response configuration Iij and covariate configuration jif, (j=1,…,ki). Let di=(dif,dim,dis), Ii=(Iif,Iim,Iis), Ji=(Jif,Jim,Jis), δi=(δifimis) and δi=(Cif,Cim,Cis). The complete data information consists of Zi=(yi,xi,Ii,Jii), (i=1,…,n). Let χ(⋅) be the indicator function, i.e. χ(gir=s)=1 if the genotype of the rth individual is s and zero otherwise. Let πs be the population proportion of the S-th genotype, t(s|i,k) be the transmission probability of a off string’s genotype s given the parents’ genotype (i,k), and θ be the collection of all the parameters, including the as and βs in the mean and the parameters in the within and between individual covariance matrices and the genotype frequencies πks and transmission probabilities t(s|i,k)s. With unobserved genotypes, the computation is a serious challenge because of the mixture nature of the model. Let f(yr|θ), F(yr|θ), and S(yr|θ)=1-F(yr|θ) be the density function, distribution function and survival function of Yr, respectively. They may be described by its genotype gr through the mean function specification. To simplify model specification, we assume random mating so that father and mother can be viewed as independent in most cases. The case of non-random mating or within parent’s dependence can be treated similarly with more involved notations.

Note that there is within family dependence but independence among different families. We assume the genotypes of each patient are unobserved; the case of observed genotypes is automatically covered and simpler. Now we describe the methods in some common settings below.

## Parametric model

Let the genotypes at the locus of interest be coded as 1,…,k. We first consider the case of no missing record and censoring. The regressive model assumes yir=μ(gir)+∈ir, (i,=1,…,n; r=f,m,1,…,n),

where μ(gir)=μ0+αχ(gir)+βxir is the mean phenotypic value, α=(α1,…,αk)', χ(gir)=(χ(gir=1),…,χ(gir=k), μ0 is the intercept vector,. The residual error term ir is a d dimensional random vector where they are independent across i but dependent across r.

In the case of multivariate familial quantitative response data, under the commonly used Elston-Stewart [15] algorithm or its variants, the likelihood of the observation yi for the i-th nuclear family is

where typically the density is assumed multivariate normal with covariate matrix Σ, f(⋅|θ,k) is the density for residual ∈i with genotype gi in the mean vector specification, f(yl|θ,i,j,r) is the density for residual r with genotype gr and with adjusted mean and variance given by

μ(gil=r)-ΩΣ-1[(yif-μ(gf=k))+(ym-μ(gm=j))] and Σ-ΩΣ-1Ω

where Σ=Cov(Yl,Yl) and Ω= Cov(Yl,Ym); K(θ,I,j) is a quantity that depends on the parents’ genotypes and the mean [18]. It is well known that when the number of genotypes is relatively large, this model is computationally inefficient [19]. Proposed a computational more efficient model. In light of [19], let Gf=Gm=(π1,…,πI), then the joint likelihood for the i-th family is written as

where

In the mean specification f(yl|G,θ,r) is the density of residual ∈l with genotype gl=r with adjusted mean and variance given by

where yp=(yf,ym),μp=(μfm), similarly for ; μm

In comparison, model (1) has three layers of mixing (summation) corresponding to bik3 function evaluations that grow exponentially with the number of genotypes. On the other hand, model (2) has only one layer of mixing in three factors each, or (bi+2)k function evaluations that are linearly proportional the number of genotypes. The reduction of computation will be more significant for multiple loci case.

Here we extend this model in the case of censoring and partial observation. In this case, the mean is modeled as

where the operation means the projection of μ0 onto the subspace corresponding to the nonzero elements of Iir, similarly for and . The corresponding error is now .

Recall that in the case of 1-dimensional observation without genetic implementation, the likelihood for an observation yi with a censoring indicator di is

To extend this to our situation, for any dimension indicator Ii, and any d-variable function v(⋅), let be the marginal version of v(⋅) with respect to the non-zero entry of Ii and Let 1-di be the indicator with the same length of di but with 0 and 1 reversed. The full likelihood is

Here extra caution should be taken since the observation vector from each individual may vary in dimensions and sub-spaces. For a d-dimensional vector v, let be its margin with respect to ; and for a d-dimensional matrix A, denote the submatrix by the rows corresponding to the non-zero entry of dijIij and columns corresponding to the non-zero entry of dilIil. In particular, has adjusted mean given by

and adjusted variance matrix given by

where Iip=(Iif,Iim). The corresponding adjustment in is made. The parameter θ is estimated by its MLE under (3), along with the restriction

## Semiparametric model

For censored data, a commonly used semi parametric regression model is Cox’s proportional hazards model [20,21]. In the univariate case, let y(3)<y(4)<…<y(n) be the ordered observations of y1,…,yn (assume no ties for simplicity), x(i) and d(i) be the associated quantities, for y(i), of the xi’s and di’s. Let R(i) be the i-th risk set, the set of all individuals who are still under study at the ‘time’ just prior to y(i), U be the set of all uncensored individuals, and

be the hazard function. The proportional hazards model has a form of λ(y/x,θ)=h(β'x)λ0(y)-0.2cm for some known positive function h(⋅), and unspecified baseline hazard rate λ0(⋅), which implies that the distribution belongs to the Lehmann family [4] 1-F=(1-F0) for some F0(⋅) and γ>0. Under these assumptions, the conditional likelihood (partial likelihood [20,21]; marginal rank likelihood, [4]) is

where the estimate of θ is the MLE under Lc(y|θ). The optimality property of is studied extensively. In the case of multivariate observations, various extensions of this method have focused on each marginal distribution and Markov chain Monte Carlo on the margins [22]. Proposed a multivariate extension of the proportional hazards model, or frailty model, which is equivalent to an exponential specification of the joint survival function [23]. Proposed a class of multivariate failure time distributions, including a multivariate version of Cox’s proportional hazards model, in which the within family dependence is modeled by a common latent variable with a known parametric distribution given that all the family members are independent. Then the joint distribution is obtained by taking expectation of the conditional one. All these frailty models assume that there is a shared common dependent latent variable. This assumption basically requires that the distribution be interchangeable among the involved individuals. This is reasonable for some familial data but not generally true. Other existing multivariate proportional hazards models [24-26] are similar in nature. Here we model the within family dependence in a manifest way to be desirable for our genetic analysis. We adopt a successive conditional version of the proportional hazards model where we assume a special semi parametric form of the survival function in order to evaluate the conditioning in closed form easily. More specifically, in our multivariate proportional hazards model, we assume h(⋅) and λ0(⋅) are functions of d-variates each. Let yi,(3),…,yi,(ni) be the ordered observations on the i-th variable (i=1,…,d), define xi,(j),di,(j),Ri,(j), and Ui accordingly. Note there are structures in h(⋅) through the dependent effects among the covariates. Recall β'x is a d-vector, let

where Ω is the within individual covariance matrix. Then h(⋅) behaves as a d-variate normal density, and its marginal and conditional versions are well defined and in closed forms, although it is not a proper density function. We need the successive ‘conditioning’ form of h(⋅) to apply the proportional hazards method. Specifically, let wij be the j-th diagonal element of Ω where Ωj be the upperleft j-dimensional sub-matrix of it, aj be the first j elements in the j-th column; [β'x]j be the first j components of β'x,hj+1|j(⋅), be the conditional version of covariates [β'x]j+1 given [β'x]j. Then hj+1|j(β'x) is a univariate normal kernel with mean and variance, and

Thus, without mixing over gene, for singleton multivariate observations, the joint conditional likelihood is

Now for the case of nuclear family, inspired by (4), we assume h(⋅) has the form

Treat h(⋅) as a ‘density’. Recall μi=(μifimi1,…,μibi). The conditioning i]j+1|[μi]j can be applied component-wise, i.e.

i]j+1|[μi]j=([μif]j+1|[μif]j, [μim]j+1|[μim]j, [μi1]j+1|[μi1]j,…, [μibi]j+1|[μibi]j)

Now we have

In (7), [μ(gj,Iir)]1, means the first component of μ(gj,Iir) in Iir, and |Iir| denote its cardinality (r=f,m,1,…,bi). Now, the conditional likelihood is

where hj+1|j([μ(i)]j is given by (7). The MLE of θ is obtained under (8).

## Nonparametric model

For univariate censored data, [27,28] considered a class of estimators, including the weighted least squares estimators, for censored data. Here the weights are determined by the ordered statistics of the observations and the associated censoring indicators, and are derived from the empirical survival function, i.e., the Kaplan- Meier product limit estimator [29-32]. Formulated the multivariate Kaplan-Meier estimator. Using the product integral, the mathematical expressions are quite involved. So instead of choosing the weights according to the multivariate Kaplan-Meier estimator, we use the nonparametric locally weighted least squares method, also called locally linear regression smoothers [33,34]. Let Y and X be the d and J-dimensional random vectors corresponding to the full observation and the covariates for an individual. Let μ(x)=E(Y|X=x) denote the regression function. In the univariate observation case, the locally linear estimator of μ(x) is first to find and to minimize

Where K(⋅) is a kernel function, hn is the bandwidth, and . In our case, keep the notations in section 1. We choose the kernel to be the J-dimensional standard normal density f(⋅), and (⋅) to be its distribution function. To simplify the expression of the likelihood, let

and similarly for imr , , , , , and . To estimate , inspired from the univariate locally linear estimator and (4), we first find to minimize

Where Ω is the within individual variance matrix, ( and are the adjusted quantities as those in (4). And I is the sub-matrix of Ω with rows and columns corresponding to the non-zero elements of Iim.

We estimate

We estimate Ω by with

where nrs is the total number of individuals with non-missing (r,s)- th components, zrls are the rearrangement of the r-th component of for which the (r,s)-th components are non-missing.

Let be the minimize of (9), where the full depends on the genotype r and the point value x. It has the intercept term (recall (3)), and is approximated by setting Direct computation of in (7) is not easy, instead we use an iterative procedure as in the following steps.

Select starting values π(0) for π. With this π(0), compute Ω(0), and T(0)(r)s. Let η=(μ0,α,β) be the full representation of the regression parameters, Xir(r=f,m,1,…,ni) be the corresponding design matrix for the r-th individual in the i-th family. In iterations 1-m do the following

(i) Fix π(i) , Ω(i) and T(i)(r) s minimize (7) with respect to η to get

where for l=f,m, and for l=1,…,ni. (ii) Fix (i), minimize (7) with respect to π, with the constraint to get

(r=1,…,k) (11)

and update Ω(i+1), and T(i+1(r) with (i+1).For some pre-specified ∈>0, when the relative errors

we stop the process at the last step m, and take

For arbitrary kernel and reasonably chosen band width hn, various asymptotic results are established in case of standard nonmixture data. We conjecture that similar results will hold under some regularity conditions.

Lastly, the bandwidth determines the smoothness of the estimate. Interesting research that addresses the crucial problem of bandwidth selection can be found in [35]. There are considerable literatures for automatic methods that attempt to minimize a lack-of-fit criterion such as an integrated squared error. But most of the methods provide an optimal hn determined by some unknown quantities. For simplicity, let k=|Jij| be the dimension of the observed covariate of the j-th (j=f,m,1,…,ni) individual in the i-th family, for the corresponding kernel, we choose hn=Cn-1/(k+1), for some constant C>0, and C can be selected through numerical trial.

## Variance components model

As an alternative to the mixture models considered above, the Variance Components (VC) model [36,37] has received much attention recently due to very efficient in computation as well as relatively robustness to model misspecification [38-46].

Let yi be the trait vector of the i-th individual in the family, in case without censoring and missing records, the commonly used VC model describing the trait value is

yi=μ+gi+Gi+ηxi+ei

Where μ is the overall mean, gi is the unobserved random vector of major gene effects at the trait locus with alleles A and B, Gi is the unobserved polygenic effects vector, the ηj’s are effects associated with the covariates xij’s, and ei is the residual random error vector. The usual assumption is that gi,Gi and ei are uncorrelated and E(gi)=E(Gi)=E(ei)=0. When missing records are present, the model is modified as

In this model, the parameters of interests are specified in the family variance matrix, thus computation can be carried out efficiently without the multiple mixing. Let ykk and Ωk be the observation, its mean and variance matrix of the k-th family. We can define Ik, dk and Jk accordingly. The commonly used model for quantitative traits is the multivariate normal distribution, thus the total likelihood is

Here f is the distribution function of the normal distribution with mean 0 and variance Ω.

The key lies in the specification of the variance matrices Ωks, which we illustrate in the following settings.

In the simplest case of Hardy-Weinberg equilibrium among locus alleles without linkage to marker, and without censoring and missing records, the covariance matrix between individuals i and j of a given family can be found, for example, in [38]. Modified to our case, it is

where is the additive genetic variance matrix due to the locus, is the dominant genetic variance matrix, is the kinship coefficient between individuals i and j [47], and Δ7ij8ij9ij, etc. are the condensed kinship coefficient of Jacquard [48], between individuals i and j.

In the more general Hardy-Weinberg disequilibrium case, let f be the within population inbreeding coefficient f at the trait locus [49- 51]. Introduced CV model in this case, which modified in our case is

where γl(f)s are matrices determined by 2 , and f etc., see there for details.

In the case of linkage to marker with both Hardy-Weinberg and linkage equilibrium, the covariance in our case can be specified based on that of, for example [40], in the same way as above. In the case of linkage to marker with either one or both Hardy-Weinberg and linkage disequilibrium, the covariance in our case can be specified based on that of [51], in the same way.

## Competing risks

Now suppose that the response yi is the failure time and only the failure for one of the d diseases is observed for each individual. For the i-th family, the data have the form (yi,di,ji,xi), where yi=(yif,yim,yi1,… ,yibi), similarly for di,ji, and xi, where ji is the observed disease type indicator. For example if the observed disease for the father is type 2, then jif=2. Given the data (yi,di,ji,xi)s, we like to investigate the objective of interests for each of the d disease. This problem is that of the competing risks. Note here the response for each individual is one-dimensional, and hence the corresponding quantities have simple notations. We are interested in the genetic regression analysis for the competing risks. We use a variant of the proportional hazards model. The mean of the j-th type, r-th member of the i-th family is specified as

For a reasonably chosen function h(⋅), we specify

More convenient below is to use the notations

and

Let yj1<…<yjkj be the kj failures of type j(j=1,…,d), R(yji),be the risk set at yji, the partial likelihood is

## Asymptotic heuristic

For IID data, various asymptotic results can be obtained. The results from the score function, the likelihood ratio statistic, and the MLE are equivalent. These results can be used to establish confidence intervals or hypothesis testing, etc. for θ. Here we are more interested in using the MLE. For general dependence model, usually the treatment is non-standard. But for our model, since the log-likelihood is in the form of several additive pieces, standard method can be used to derive the asymptotic distribution of the MLE. Let zi=(yi,di)(i=1,…,n). For the IID data, it is well known that under mild regularity conditions, the MLE nis strongly consistent and asymptotically distributed normal with mean at the true parameter value θ0, and variance matrix given by the inverse of the Fisher information. Here the observations are unbalanced, the asymptotic variance is the Fisher information times a weight matrix. To derive it, we need some notations, and mainly concentrate on model (4).

Let Nip be the total number of parents with the i-th measurement non-missing (i=1,…,d), Niss be those for the siblings, N be the total number of individuals in the study, γNir=Nir/N(r=p,s). Assume limN→8γNir=γir>0 exists (r=p,s;i=1,…,d). Let Yp and Yj be general random vectors associated with a parent and sib respectively, and Δp and Δj be the corresponding random vectors.

For model (4), let

here we use the notation Δp v(⋅) to represent the marginal version of v(⋅) corresponding to the non-zero components of Δp. The Fisher information matrix is

The above expectation is more involved than it looks, since that involves summations of all possible combinations of non-zero elements of it with respect to Δr, and also the unknown distribution of it. Instead, an empirical version of it has a known form

where L(y\θ) is given by (4). At the true data generating parameter θ0, IN is strongly consistent for I(⋅). To obtain the weight matrix, we need to specify the parameter order in θ. We arrange the first k-1 entry to be π1,…,πk-1, next we arrange all the regression parameters for the first response variable,..., all the regression parameters for the last response variable, then all the independent parameters in the variance matrix S and covariance matrix &Omega; in the similar order. It is clear that, in the weight matrix W, for (i,j) corresponding to the first k-1 components in θ, the weight should be γp1p+…+γdp; for (i,j) corresponding to the r-th and the l-th regression parameters, the weight is ; for (i,j) corresponding to the (a,b)-th and the (u,v)-th variance or covariance, the weight is [(γapas) (γbpbs) (γupus) (γvpvs)]1/4. Let θ0 be the true unknown data generating parameter, d→ stands for convergence in distribution. Then, we have

where A⊗B stands for the Kronecker product of matrices A and B. Since I(⋅) involves unknown quantities, equivalently

where WN is W with the γs replaced by the γNs.

The above ideal applies to the other models in this paper, but the results will be more involved, and we only discuss them briefly.

For the proportional hazards model, even for the IID data case, the conditional likelihood looks much different from the full likelihood. Interestingly, the MLE from this model (under the assumption of correct model specification and some regularity conditions) has the same asymptotic distribution as that from the full likelihood [21,52]. For the proportional hazards model, it is noted [4] that the survival function can be written as

S(y\x,β)=-S0(y)h(β'x),

and f(y\x,β)=-dS(y\x,β)/dy. So to get the full log-likelihood (14), we need the estimate of S0(y) for the d-dimensional case [29,30]. Proposed the multi-dimensional generalization of the Kaplan-Meyer nonparametric estimator of S(y), similar technique can be used here for the construction of S0(y). Then (18) continues to hold in this case. Due to technical involvement, we will not pursue the details here.

For the least squares estimator, since the weight involves the kernel and hn, the treatment is different from those above, and in the case of full observation generally the asymptotic result is of the form

for some constant C and matrix Ω determined by the kernel and the true (unknown) data and censoring distributions [53,54]. In our case of partial observation, the above result holds with Ω replaced by Ω⊗W.

For the competing risks model, the structure is similar to that of the proportional hazards model. Here the response is one dimensional so that the survival function can be estimated by the Kaplan-Meyer estimator and the weight matrix W is the identity.

## Discussion

We have considered several statistical methods, parametric, semi parametric and nonparametric models, for the genetic regression analysis of familial multiple endpoints data, with possible missing records. Here we only considered the case of nuclear families and the parameters are independent of time. The cases of arbitrary pedigrees and/or the time dependent parameter can be treated similarly. The variance components method can also be applied to the proportional hazards model and in the analysis of competing risks. There are some marginal models for the multiple endpoints data, which work well in practice. But we think the joint model is more appropriate when the within responses structure is important in the analysis. Another commonly used method to deal with the missing data is the EM algorithm [55], which can be implemented into the models considered here. But for the multiple endpoints data, the proportion of missing part is usually large; the EM algorithm may not be efficient. When the missing pattern is non-ignorable, more complicated approaches need to be considered to reduce potential biases. Hypothesis testing for parameters of interests can be conducted using the likelihood ratio statistics based on the parametric models. We only derived the basic forms of these models; more features can be implemented to them in particular applications.

## References

Citation: Yuan A and Ahn J. Some Genetic Regression Models for Multiple Quantitative End Points Data. Austin Biom and Biostat. 2016; 3(1): 1029. ISSN: 2378-9840