Ao Yuan; Jaeil Ahn

Research Article

Austin Biom and Biostat. 2016; 3(1): 1029.

Some Genetic Regression Models for Multiple Quantitative End Points Data

Ao Yuan* and Jaeil Ahn

Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, USA

*Corresponding author: Ao Yuan, Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, Washington D.C. 20057, USA

Received: May 28, 2015; Accepted: January 28, 2016; Published: February 08, 2016

Abstract

Multiple endpoints data are common in practice. There are various statistical methods for the analysis of this type of data, however, genetic models for familial observations with multiple endpoints data are relatively few, and the existing methods are basically variations of the Elston-Stewart algorithm. Here we consider several joint statistical models for such data with quantitative measurements with a new algorithm, which is computationally more efficient than the existing method. The proposed method is detailed in some commonly used parametric, semi parametric and nonparametric settings for this type of data. For un-genotyped data, the commonly used models are the mixture and variance components models. We elaborate how these genetic models can be extended for multiple endpoints data with the proposed method

Keywords: Censoring; Endpoints data; Familial structure; Genotype; Missing observation

Introduction

Endpoints data are observed responses from patients of some pre-specified clinical events of interests, such as death, loss of vision, occurrences of certain diseases, or other symptomatic events. In medical research, study participants are often followed for a long time, during which some participants may drop out early, so that random censorship may be present in the data. Such data have missing observations, which may be inhomogeneous across the patients. For example, in one patient we have observations on the lung cancer and kidney disease, and on another we have observations on lung cancer, diabetes and asthma. Analyses of such data largely fall into two categories: hypothesis testing (usually non-model based) and model inference. Here we concentrate on the model inference of such data.

For censored data modeling there are extensive literatures [1-8], just mention a few. For multiple endpoints data analyses, there are various statistical methods [9-13], for example. Wei and Glidden [14] provided an overview for some of the methods in this field.

Family genetic data differ from the ordinary data in that they are collected in familial units, often with varying structures and sizes, and with/without genotyping. These features make the models distinct. The key in the modeling is the familial dependence structure and the implementation of the genetic mechanism, existing methods are basically variations of the Elston-Stewart algorithm, which is a multi-level mixture model, and the computation is often challenging. Genetic models for multiple endpoints data are relatively limited. Here we consider some statistical joint models for such data with quantitative measurements with a new algorithm, which is a one-level mixture model, thus enhance the computation considerably. The parametric method is used when one has some confidence about the model specification. The semi parametric method can be used when there is not enough information about the full parametric model specification. The nonparametric method is used for the robustness of least model assumptions. We elaborate our methods for the parametric, semi parametric and nonparametric cases. The methods we describe below are valid for arbitrary pedigrees; however, in this article we focus on the simpler case of nuclear family for illustration.

In genetic analysis, the data contains genotypes, partially genotypes, or no genotypes. However, even if the data are genotyped, it is still of interest to know whether there are some other unknown gene(s) behind the response functioning. There are reports that with added unknown gene locus, the likelihood Akaike information reduced (e.g., [15], p.1091, [16], which makes sense, as correct parameter(s) added to the model will reduce its AIC), or the segregation analysis guided to some other gene(s) which deserve(s) further investigation. So even for the genotyped data, a segregation model is still of importance. It is also the general model including the genotyped data case. In the following we derive some commonly used regressive models for this case, including the parametric model, semiparametric proportional hazards model, nonparametric least squares model, variance components model and the competing risks model. Also, hypothesis testing on parameters of interest can be conducted using the likelihood ratio statistics based on the parametric models. Our aim here is to present several new parametric, semi parametric and nonparametric models for this familial data, and thus we focus derivations of the basic forms of these models. Implementations of these models and applications to real data will follow in our future work.

Methods

Suppose there are d responses observed with some clinical events of interests, along with r covariates, for each member in a family. We concentrate on nuclear family structure for simplicity. In practice we only observe a subset of the responses and covariates for each patient in a family. Let y_i=(y_if,y_im,y_is)(i=1,…,n) be the vector of responses for the i-th nuclear family, with corresponding covariates x_i=(x_if,x_im,x_is) (i=1,…,n). Here y_if a d_if(<d) dimensional vector of responses of the father in the i-th family, which belongs to a dif dimensional subspace of the d-dimensional space, with a response non-missing indicator vector I_if and covariate non-missing indicator vector J_if. Similarly, y_im denotes a vector of responses of the mother in the i-th family and y_is=(y_i1,…,y_ibi) is for offspring with each of the y_ij s has the same data structure as that for y_if. For example, there are three responses to be observed, we have the first and third on the father, then I_if=(1,0,1), and d_if=|I_if|=2 is its dimension or cardinality. If there are total of five covariates in the design, and we may only have the first, second, fourth covariates for the father, then J_if=(1,1,0,1,0), and r_if=|J_if|=3 is its dimension. We assume random censorship. Let δ_if=(δ_if1,…, δ_ifd_if) be the censoring indicator of y_if, i.e. δ_ifj=1 if y_ifj is uncensored, and δ_ifj=0 otherwise. Similar notations are used for the mother. For the off springs, y_is = (y_i1,...,y_ki) denote the response vector, with y_ij be d_ij dimensional observation for the j^th sib, with response configuration I_ij and covariate configuration j_if, (j=1,…,k_i). Let d_i=(d_if,d_im,d_is), I_i=(I_if,I_im,I_is), J_i=(J_if,J_im,J_is), δ_i=(δ_if,δ_im,δ_is) and δ_i=(C_if,C_im,C_is). The complete data information consists of Z_i=(y_i,x_i,I_i,J_i,δ_i), (i=1,…,n). Let χ(⋅) be the indicator function, i.e. χ(gir=s)=1 if the genotype of the rth individual is s and zero otherwise. Let πs be the population proportion of the S-th genotype, t(s|i,k) be the transmission probability of a off string’s genotype s given the parents’ genotype (i,k), and θ be the collection of all the parameters, including the as and βs in the mean and the parameters in the within and between individual covariance matrices and the genotype frequencies πk_s and transmission probabilities t(s|i,k)_s. With unobserved genotypes, the computation is a serious challenge because of the mixture nature of the model. Let f(y_r|θ), F(y_r|θ), and S(y_r|θ)=1-F(y_r|θ) be the density function, distribution function and survival function of Y_r, respectively. They may be described by its genotype g_r through the mean function specification. To simplify model specification, we assume random mating so that father and mother can be viewed as independent in most cases. The case of non-random mating or within parent’s dependence can be treated similarly with more involved notations.

Note that there is within family dependence but independence among different families. We assume the genotypes of each patient are unobserved; the case of observed genotypes is automatically covered and simpler. Now we describe the methods in some common settings below.

Parametric model

Let the genotypes at the locus of interest be coded as 1,…,k. We first consider the case of no missing record and censoring. The regressive model assumes y_ir=μ(g_ir)+∈_ir, (i,=1,…,n; r=f,m,1,…,n),

where μ(g_ir)=μ0+αχ(g_ir)+βx_ir is the mean phenotypic value, α=(α₁,…,α_k)', χ(g_ir)=(χ(g_ir=1),…,χ(g_ir=k), μ₀ is the intercept vector,. The residual error term ∈_ir is a d dimensional random vector where they are independent across i but dependent across r.

In the case of multivariate familial quantitative response data, under the commonly used Elston-Stewart [15] algorithm or its variants, the likelihood of the observation yi for the i-th nuclear family is

$L_{i} (y_{i} | θ)= \sum_{k} (π_{k} f (y_{i f} | θ, g_{i f} = k) \sum_{j} (π_{j} f (y_{i m} | θ, g_{i m} = j) K (θ, k, j)$

$\times \prod_{l =1}^{b_{i}} \sum_{r} t (r | k, j) f (y_{i l} | θ, g_{i f} = k, g_{i m} = j, g_{i l} = r))),(1)$

where typically the density is assumed multivariate normal with covariate matrix Σ, f(⋅|θ,k) is the density for residual ∈i with genotype g_i in the mean vector specification, f(y_l|θ,i,j,r) is the density for residual ∈_r with genotype g_r and with adjusted mean and variance given by

μ(g_il=r)-ΩΣ^-1[(y_if-μ(g_f=k))+(y_m-μ(g_m=j))] and Σ-ΩΣ^-1Ω

where Σ=Cov(Y_l,Y_l) and Ω= Cov(Y_l,Y_m); K(θ,I,j) is a quantity that depends on the parents’ genotypes and the mean [18]. It is well known that when the number of genotypes is relatively large, this model is computationally inefficient [19]. Proposed a computational more efficient model. In light of [19], let G_f=G_m=(π₁,…,π_I), then the joint likelihood for the i-th family is written as

$L_{i} (y_{i} | θ)=(\sum_{k} π_{k} f (y_{i f} | θ, g_{i f} = k))(\sum_{j} π_{j} f (y_{i m} | θ, g_{i m} = j))$

$K (θ) \prod_{l =1}^{b_{i}} \sum_{r} T (r) f (y_{i l} | G, θ, g_{i l} = r),(2)$

where $K (θ)= \sum_{i =1}^{k} \sum_{j =1}^{k} K (θ, i, j)$

$T (r)= T (r | G_{f}, G_{m})= \sum_{i =1}^{k} \sum_{s =1}^{k} t (r | i, s) P (g_{f} = i) P (g_{m} = s)= \sum_{i =1}^{k} \sum_{s =1}^{k} t (r | i, s) π_{i} π_{s},$

In the mean specification f(y_l|G,θ,r) is the density of residual ∈l with genotype gl=r with adjusted mean and variance given by

$μ (g_{i l} = r) - {Ω^{'}}_{p} Σ_{p}^{- 1} (y_{p} - μ_{p}) a n d Σ - {Ω^{'}}_{p} Σ_{p}^{- 1} Ω_{p},$

where y_p=(y_f,y_m),μ_p=(μ_f,μ_m), $μ_{f} = \sum_{i =1}^{k} π_{i} μ_{f} (g_{f} = i)$ similarly for ; μ_m

$Ω_{p} = (\begin{matrix} Ω \\ Ω \end{matrix}) Σ_{p} = (\begin{matrix} Σ & Ω \\ Ω & Σ \end{matrix}) .$

In comparison, model (1) has three layers of mixing (summation) corresponding to b_ik³ function evaluations that grow exponentially with the number of genotypes. On the other hand, model (2) has only one layer of mixing in three factors each, or (b_i+2)k function evaluations that are linearly proportional the number of genotypes. The reduction of computation will be more significant for multiple loci case.

Here we extend this model in the case of censoring and partial observation. In this case, the mean is modeled as

$μ (g_{i r})= I_{i r} ⊙ (μ_{0} + α χ (g_{i r}) + β) J_{i r} ⊙ x_{i r}, - 0.6 c m (3)$

where the operation $I_{i r} ⊙ μ_{0}$ means the projection of μ₀ onto the subspace corresponding to the nonzero elements of Iir, similarly for $I_{i r} ⊙ α χ (g_{i r})$ and $J_{i r} ⊙ x_{i r}$ . The corresponding error is now $I_{i r} ⊙ ε_{i r}$ .

Recall that in the case of 1-dimensional observation without genetic implementation, the likelihood for an observation yi with a censoring indicator di is

$L_{i} (y_{i} | θ)= f {(y_{i} | θ)}^{δ_{i}} S {(y_{i} | θ)}^{1 - δ_{i}} .$

To extend this to our situation, for any dimension indicator I_i, and any d-variable function v(⋅), let $I_{i} ⊙ v (\cdot)$ be the marginal version of v(⋅) with respect to the non-zero entry of I_i and $δ_{i} I_{i} ⊙ v (\cdot)= δ_{i} ⊙ (I_{i} ⊙ v (\cdot)).$ Let 1-d_i be the indicator with the same length of d_i but with 0 and 1 reversed. The full likelihood is

$L (z | θ)= \prod_{i =1}^{n} (\sum_{j =1}^{k} π_{j} δ_{i f} I_{i f} ⊙ f (y_{i f} | θ, j))(\sum_{j =1}^{k} π_{j} (1 - δ_{i f}) I_{i f} ⊙ S (y_{i f} | θ, j))$

$\times (\sum_{j =1}^{k} π_{j} δ_{i m} I_{i m} ⊙ f (y_{i m} | θ, j))(\sum_{j =1}^{k} π_{j} (1 - δ_{i m}) I_{i m} ⊙ S (y_{i m} | θ, j))$

$\times \prod_{j =1}^{b_{i}} (\sum_{r =1}^{k} T (r) δ_{i j} I_{i j} ⊙ f (y_{j} | G, θ, r))(\sum_{r =1}^{k} T (r)(1 - δ_{i j}) I_{i j} ⊙ S (y_{j} | G, θ, r)).(4)$

Here extra caution should be taken since the observation vector from each individual may vary in dimensions and sub-spaces. For a d-dimensional vector v, let $δ_{i j} I_{i j} ⊙ v$ be its margin with respect to ; and for a d-dimensional matrix A, $δ_{i j} I_{i j} ⊙ A ⊙ I_{i l} δ_{i l}$ denote the submatrix by the rows corresponding to the non-zero entry of d_ijI_ij and columns corresponding to the non-zero entry of d_ilI_il. In particular, $δ_{i j} I_{i j} ⊙ A ⊙ I_{i l} δ_{i l}$ has adjusted mean given by

$δ_{i j} I_{i j} ⊙ μ (g_{i j} = r) - (δ_{i j} I_{i j} ⊙ {Ω^{'}}_{p} ⊙ I_{i p} δ_{i p})(δ_{i p} I_{i p} ⊙ Σ_{p}^{- 1} ⊙ I_{i p} δ_{i p})(δ_{i p} I_{i p} ⊙ (y_{p} - μ_{p})),$

and adjusted variance matrix given by

$δ_{i j} I_{i j} ⊙ Σ ⊙ I_{i j} δ_{i j} - (δ_{i j} I_{i j} ⊙ {Ω^{'}}_{p} ⊙ I_{i p} δ_{i p})(δ_{i p} I_{i p} ⊙ Σ_{p}^{- 1} ⊙ I_{i p} δ_{i p})(δ_{i p} I_{i p} ⊙ Ω_{p} ⊙ I_{i j} δ_{i j}),$

where I_ip=(I_if,I_im). The corresponding adjustment in $(1 - δ_{i j}) I_{i j} ⊙ S (y_{j} | G, θ, r)$ is made. The parameter θ is estimated by its MLE $\hat{θ}$ under (3), along with the restriction $\sum_{j =1}^{k} π_{j} =1$

Semiparametric model

For censored data, a commonly used semi parametric regression model is Cox’s proportional hazards model [20,21]. In the univariate case, let y₍₃₎<y₍₄₎<…<y_(n) be the ordered observations of y₁,…,y_n (assume no ties for simplicity), x_(i) and d_(i) be the associated quantities, for y_(i), of the xi’s and di’s. Let R_(i) be the i-th risk set, the set of all individuals who are still under study at the ‘time’ just prior to y_(i), U be the set of all uncensored individuals, and

$λ (y | x, θ)= \frac{f (y | x, θ)}{1 - F (y | x, θ)}$

be the hazard function. The proportional hazards model has a form of λ(y/x,θ)=h(β'x)λ0(y)-0.2cm for some known positive function h(⋅), and unspecified baseline hazard rate λ0(⋅), which implies that the distribution belongs to the Lehmann family [4] 1-F=(1-F₀) for some F₀(⋅) and γ>₀. Under these assumptions, the conditional likelihood (partial likelihood [20,21]; marginal rank likelihood, [4]) is

$L_{c} (y | θ)= \prod_{i \in U} \frac{h (β^{'} x_{(i)})}{\sum_{j \in R_{(i)}} h (β^{'} x_{(j)})},$

where the estimate of θ is the MLE $\hat{θ}$ under L_c(y|θ). The optimality property of $\hat{θ}$ is studied extensively. In the case of multivariate observations, various extensions of this method have focused on each marginal distribution and Markov chain Monte Carlo on the margins [22]. Proposed a multivariate extension of the proportional hazards model, or frailty model, which is equivalent to an exponential specification of the joint survival function [23]. Proposed a class of multivariate failure time distributions, including a multivariate version of Cox’s proportional hazards model, in which the within family dependence is modeled by a common latent variable with a known parametric distribution given that all the family members are independent. Then the joint distribution is obtained by taking expectation of the conditional one. All these frailty models assume that there is a shared common dependent latent variable. This assumption basically requires that the distribution be interchangeable among the involved individuals. This is reasonable for some familial data but not generally true. Other existing multivariate proportional hazards models [24-26] are similar in nature. Here we model the within family dependence in a manifest way to be desirable for our genetic analysis. We adopt a successive conditional version of the proportional hazards model where we assume a special semi parametric form of the survival function in order to evaluate the conditioning in closed form easily. More specifically, in our multivariate proportional hazards model, we assume h(⋅) and λ₀(⋅) are functions of d-variates each. Let y_i,(3),…,y_i,(ni) be the ordered observations on the i-th variable (i=1,…,d), define x_i,(j),d_i,(j),R_i,(j), and U_i accordingly. Note there are structures in h(⋅) through the dependent effects among the covariates. Recall β'x is a d-vector, let

$h (β^{'} x)= e^{- \frac{1}{2} (β^{'} x)^{'} Ω β^{'} x},$

where Ω is the within individual covariance matrix. Then h(⋅) behaves as a d-variate normal density, and its marginal and conditional versions are well defined and in closed forms, although it is not a proper density function. We need the successive ‘conditioning’ form of h(⋅) to apply the proportional hazards method. Specifically, let w_ij be the j-th diagonal element of Ω where Ω_j be the upperleft j-dimensional sub-matrix of it, a_j be the first j elements in the j-th column; [β'x]_j be the first j components of β'x,h_j+1|j(⋅), be the conditional version of covariates [β'x]_j+1 given [β'x]j. Then h_j+1|j(β'x) is a univariate normal kernel with mean $- {a^{'}}_{j} Ω_{j}^{- 1} {(β^{'} x)}_{j}$ and variance $ω_{j j} - {a^{'}}_{j} Ω_{j}^{- 1} a$ , and

$h (β^{'} x)= h_{1|0} ([β^{'} x]_{1}) \prod_{j =1}^{d - 1} h_{j + 1| j} ([β^{'} x]_{j + 1}).(5)$

Thus, without mixing over gene, for singleton multivariate observations, the joint conditional likelihood is

$L_{c} (z | θ)= \prod_{i \in U_{1}} \frac{h ([β^{'} x_{(i)}]_{1})}{\sum_{l \in R_{1,(i)}} h ([β^{'} x_{(l)}]_{1})} \prod_{j =1}^{d - 1} \prod_{i \in U_{j}} \frac{h_{j + 1| j} ([β^{'} x_{(i)}]_{j})}{\sum_{l \in R_{j,(i)}} h_{j + 1| j} ([β^{'} x_{(l)}]_{j})} .$

Now for the case of nuclear family, inspired by (4), we assume h(⋅) has the form

$h (μ_{i})= \sum_{j =1}^{k} π_{j} h_{I_{i f}} (μ (g_{j}, I_{i f})) \sum_{j =1}^{k} π_{j} h_{I_{i m}} (μ (g_{j}, I_{i m})) \prod_{l =1}^{b_{i}} \sum_{j =1}^{k} T (j) h_{I_{i j}} (μ (g_{j}, I_{i j})),(6)$

Treat h(⋅) as a ‘density’. Recall μ_i=(μif,μ_im,μ_i1,…,μ_ibi). The conditioning [μ_i]_j+1|[μ_i]_j can be applied component-wise, i.e.

Now we have

$h_{j + 1| j} ([μ_{i}]_{j + 1})=(\sum_{j =1}^{k} π_{j} h_{I_{i f},1|0} ([μ (g_{j}, I_{i f} {)]}_{1}) \prod_{j =1}^{| I_{i f} | - 1} h_{I_{i f}, j + 1| j} ([μ (g_{j}, I_{i f} {)]}_{j}))$

$\times (\sum_{j =1}^{k} π_{j} h_{I_{i m},1|0} ([μ (g_{j}, I_{i m} {)]}_{1}) \prod_{j =1}^{| I_{i m} | - 1} h_{I_{i f}, j + 1| j} ([μ (g_{j}, I_{i m} {)]}_{j}))$

$\times \prod_{l =1}^{b_{i}} (\sum_{j =1}^{k} T (j) h_{I_{i l},1|0} ([μ (g_{j}, I_{i l} {)]}_{1}) \prod_{j =1}^{| I_{i l} | - 1} h_{I_{i l}, j + 1| j} ([μ (g_{j}, I_{i l} {)]}_{j})).(7)$

In (7), [μ(g_j,I_ir)]1, means the first component of μ(g_j,I_ir) in I_ir, and |I_ir| denote its cardinality (r=f,m,1,…,b_i). Now, the conditional likelihood is

$L_{c} (z | θ)= \prod_{i \in U_{1}} \frac{h ([μ_{(i)}]_{1})}{\sum_{l \in R_{1,(i)}} h ([μ_{(i)}]_{1})} \prod_{j =1}^{d - 1} \prod_{i \in U_{j}} \frac{h_{j + 1| j} ([μ_{(i)}]_{j})}{\sum_{l \in R_{j,(i)}} h_{j + 1| j} ([μ_{(i)}]_{j})},(8)$

where h_j+1|j([μ_(i)]j is given by (7). The MLE $\hat{θ}$ of θ is obtained under (8).

Nonparametric model

For univariate censored data, [27,28] considered a class of estimators, including the weighted least squares estimators, for censored data. Here the weights are determined by the ordered statistics of the observations and the associated censoring indicators, and are derived from the empirical survival function, i.e., the Kaplan- Meier product limit estimator [29-32]. Formulated the multivariate Kaplan-Meier estimator. Using the product integral, the mathematical expressions are quite involved. So instead of choosing the weights according to the multivariate Kaplan-Meier estimator, we use the nonparametric locally weighted least squares method, also called locally linear regression smoothers [33,34]. Let Y and X be the d and J-dimensional random vectors corresponding to the full observation and the covariates for an individual. Let μ(x)=E(Y|X=x) denote the regression function. In the univariate observation case, the locally linear estimator $\hat{μ} (x)$ of μ(x) is first to find $\hat{a}$ and $\hat{b}$ to minimize

$\sum_{i =1}^{n} {(y_{i} - a - b (x - x_{i}))}^{2} K (\frac{x - x_{i}}{h_{n}}),$

Where K(⋅) is a kernel function, hn is the bandwidth, and $\hat{μ} (x)= \hat{a}$ . In our case, keep the notations in section 1. We choose the kernel to be the J-dimensional standard normal density f(⋅), and (⋅) to be its distribution function. To simplify the expression of the likelihood, let

${\tilde{y}}_{i f r} = y_{i f} - I_{i f} ⊙ μ (r), {\tilde{x}}_{i f} = J_{i f} ⊙ (x - x_{i f}), {\overset{⌣}{x}}_{i f} =(1 - J_{i f}) ⊙ (x - x_{i f}),$

and similarly for imr ${\tilde{y}}_{i m r}$ , ${\tilde{y}}_{i j r}$ , ${\tilde{x}}_{i m}$ , ${\overset{⌣}{x}}_{i m}$ , ${\tilde{x}}_{i j}$ , and ${\tilde{x}}_{i j}$ . To estimate $\hat{μ} (x)$ , inspired from the univariate locally linear estimator and (4), we first find $({\hat{μ}}_{0} (x), \hat{α}, \hat{β}, \hat{π})$ to minimize

$\sum_{i =1}^{n} (\sum_{r =1}^{k} π_{r}^{2} \tilde{y}'_{i f r} (I_{i f} ⊙ Ω ⊙ I_{i f}) {\tilde{y}}_{i f r} J_{i f} ⊙ φ (\frac{{\tilde{x}}_{i f}}{h_{n}})(1 - (1 - J_{i f}) ⊙ Φ (\frac{{\overset{⌣}{x}}_{i f}}{h_{n}})$

$+ \sum_{r =1}^{k} π_{r}^{2} \tilde{y}'_{i m r} (I_{i m} ⊙ Ω ⊙ I_{i m}) {\tilde{y}}_{i m r} J_{i m} ⊙ φ (\frac{{\tilde{x}}_{i m}}{h_{n}})(1 - (1 - J_{i m}) ⊙ Φ (\frac{{\overset{⌣}{x}}_{i m}}{h_{n}})$

$+ \sum_{j =1}^{b_{i}} \sum_{r =1}^{k} T^{2} (r) \tilde{y}'_{i j r} (I_{i j} ⊙ Ω ⊙ I_{i j}) y_{i j r} J_{i j} ⊙ φ (\frac{{\tilde{x}}_{i j}}{h_{n}} | {\tilde{x}}_{p})(1 - (1 - J_{i j}) ⊙ Φ (\frac{{\overset{⌣}{x}}_{i j}}{h_{n}} | {\overset{⌣}{x}}_{p})) (9)$

Where Ω is the within individual variance matrix, ( $φ (\cdot | {\tilde{x}}_{p})$ and $Φ (\cdot | {\overset{⌣}{x}}_{p})$ are the adjusted quantities as those in (4). And $I_{i r} ⊙ Ω ⊙ I_{i r}$ I is the sub-matrix of Ω with rows and columns corresponding to the non-zero elements of I_im.

We estimate

We estimate Ω by $\hat{Ω} =({\hat{ω}}_{r s})$ with

${\hat{ω}}_{r s} = \frac{1}{n_{r s} - 1} \sum_{l =1}^{n_{r s}} (z_{r l} - {\bar{z}}_{l})(z_{s l} - {\bar{z}}_{s}), i, j =1,..., d,$

where nrs is the total number of individuals with non-missing (r,s)- th components, zrls are the rearrangement of the r-th component of $y_{i t} - \sum_{j = 1}^{k} π_{k} I_{i t} ⊙ μ (j)$ for which the (r,s)-th components are non-missing.

Let $\hat{μ} (r, x), \hat{π})$ be the minimize of (9), where the full $\hat{μ} (r, x)$ depends on the genotype r and the point value x. It has the intercept term ${\hat{μ}}_{0} (x)$ (recall (3)), and $\hat{μ} (x)$ is approximated by setting $\hat{μ} (x)= {\hat{μ}}_{0} (x)$ Direct computation of $\hat{μ} (r, x), \hat{π})$ in (7) is not easy, instead we use an iterative procedure as in the following steps.

Select starting values π⁽⁰⁾ for π. With this π⁽⁰⁾, compute Ω⁽⁰⁾, and T⁽⁰⁾(r)s. Let η=(μ₀,α,β) be the full representation of the regression parameters, X_ir(r=f,m,1,…,n_i) be the corresponding design matrix for the r-th individual in the i-th family. In iterations 1-m do the following

(i) Fix π(i) , Ω(i) and T(i)(r) s minimize (7) with respect to η to get

$η^{(i)} =(\sum_{i =1}^{n} \sum_{r =1}^{k} \sum_{l} ϖ_{l r}^{(i)2} X_{i l} {X^{'}}_{i l} J_{i l} ⊙ φ (\frac{{\tilde{x}}_{i l}}{h_{n}})(1 - (1 - J_{i l}) ⊙ Φ (\frac{{\overset{⌣}{x}}_{i l}}{h_{n}} {)))}^{- 1}$

$\times \sum_{i =1}^{n} \sum_{r =1}^{k} \sum_{l} ϖ_{l r}^{(i)2} y_{i l} {X^{'}}_{i l} J_{i l} ⊙ φ (\frac{{\tilde{x}}_{i l}}{h_{n}})(1 - (1 - J_{i l}) ⊙ Φ (\frac{{\overset{⌣}{x}}_{i l}}{h_{n}})),(10)$

where $ϖ_{l r}^{(i)} = π_{r}^{(i)}$ for l=f,m, and $ϖ_{l r}^{(i)} = T^{(i)} (r)$ for l=1,…,n_i. (ii) Fix _(i), minimize (7) with respect to π, with the constraint $\sum_{j =1}^{k} π_{j} =1$ to get

$π_{r}^{(i + 1)} = \frac{\sum_{i =1}^{n} \sum_{l = f, m} \tilde{y}'_{i l r} (I_{i l} ⊙ Ω ⊙ I_{i l}) {\tilde{y}}_{i l r} J_{i l} ⊙ φ (\frac{{\tilde{x}}_{i l}}{h_{n}})(1 - (1 - J_{i l}) ⊙ Φ (\frac{{\overset{⌣}{x}}_{i l}}{h_{n}}))}{\sum_{i =1}^{n} \sum_{l = f, m} \sum_{r =1}^{k} \tilde{y}'_{i l r} (I_{i l} ⊙ Ω ⊙ I_{i l}) {\tilde{y}}_{i l r} J_{i l} ⊙ φ (\frac{{\tilde{x}}_{i l}}{h_{n}})(1 - (1 - J_{i l}) ⊙ Φ (\frac{{\overset{⌣}{x}}_{i l}}{h_{n}}))}, (11)$

(r=1,…,k) (11)

and update Ω⁽ⁱ⁺¹⁾, and T^(i+1(r) with ⁽ⁱ⁺¹⁾.For some pre-specified ∈>0, when the relative errors

$\frac{|(μ^{(m)} (r, x), π^{(m)}) - (μ^{(m - 1)} (r, x), π^{(m - 1)})|}{|(μ^{(m - 1)} (r, x), π^{(m - 1)})|} \leq ε$

we stop the process at the last step m, and take

$(\hat{μ} (r, x), \hat{π})=(μ^{(m)} (r, x), π^{(m)})$

For arbitrary kernel and reasonably chosen band width h_n, various asymptotic results are established in case of standard nonmixture data. We conjecture that similar results will hold under some regularity conditions.

Lastly, the bandwidth determines the smoothness of the estimate. Interesting research that addresses the crucial problem of bandwidth selection can be found in [35]. There are considerable literatures for automatic methods that attempt to minimize a lack-of-fit criterion such as an integrated squared error. But most of the methods provide an optimal hn determined by some unknown quantities. For simplicity, let k=|J_ij| be the dimension of the observed covariate of the j-th (j=f,m,1,…,n_i) individual in the i-th family, for the corresponding kernel, we choose h_n=Cn^-1/(k+1), for some constant C>0, and C can be selected through numerical trial.

Variance components model

As an alternative to the mixture models considered above, the Variance Components (VC) model [36,37] has received much attention recently due to very efficient in computation as well as relatively robustness to model misspecification [38-46].

Let y_i be the trait vector of the i-th individual in the family, in case without censoring and missing records, the commonly used VC model describing the trait value is

y_i=μ+g_i+G_i+ηx_i+e_i

Where μ is the overall mean, gi is the unobserved random vector of major gene effects at the trait locus with alleles A and B, G_i is the unobserved polygenic effects vector, the ηj’s are effects associated with the covariates x_ij’s, and ei is the residual random error vector. The usual assumption is that g_i,G_i and e_i are uncorrelated and E(g_i)=E(G_i)=E(e_i)=0. When missing records are present, the model is modified as

$y_{i} = I_{i} ⊙ (μ + g_{i} + G_{i} + η) J_{i} ⊙ x_{i} + I_{i} ⊙ e_{i} (12)$

In this model, the parameters of interests are specified in the family variance matrix, thus computation can be carried out efficiently without the multiple mixing. Let y_k,π_k and Ω_k be the observation, its mean and variance matrix of the k-th family. We can define I_k, d_k and J_k accordingly. The commonly used model for quantitative traits is the multivariate normal distribution, thus the total likelihood is

$L (z | θ)= \sum_{k =1}^{K} δ_{k} I_{k} ⊙ φ (y_{k} - μ_{k} | Ω_{k})(1 - δ_{k}) I_{k} ⊙ Φ y_{k} - μ_{k} | Ω_{k}).$

Here f is the distribution function of the normal distribution with mean 0 and variance Ω.

The key lies in the specification of the variance matrices Ω_ks, which we illustrate in the following settings.

In the simplest case of Hardy-Weinberg equilibrium among locus alleles without linkage to marker, and without censoring and missing records, the covariance matrix between individuals i and j of a given family can be found, for example, in [38]. Modified to our case, it is

$C o v (Y_{k i}, Y_{k j})= {\begin{array}{l} I_{i} ⊙ (σ_{a}^{2} + σ_{d}^{2} + σ_{G}^{2} + σ_{e}^{2}) ⊙ I_{j} & i f i = j \\ 2 Φ_{i j} I_{i} ⊙ (σ_{a}^{2} + Δ_{7 i j} σ_{d}^{2} + 2 Φ_{i j} σ_{G}^{2}) ⊙ I_{j}, & i f i \neq j \end{array} (13)$

where $σ_{a}^{2}$ is the additive genetic variance matrix due to the locus, $σ_{d}^{2}$ is the dominant genetic variance matrix, $Φ_{i j} = Δ_{7 i j} /2 + Δ_{8 i j} /4$ is the kinship coefficient between individuals i and j [47], and Δ_7ij,Δ_8ij,Δ_9ij, etc. are the condensed kinship coefficient of Jacquard [48], between individuals i and j.

In the more general Hardy-Weinberg disequilibrium case, let f be the within population inbreeding coefficient f at the trait locus [49- 51]. Introduced CV model in this case, which modified in our case is

$C o v (Y_{i}, Y_{j} | f)= {\begin{array}{l} I_{i} ⊙ ((1 + \frac{f}{2}) σ_{a}^{2} + (1 - f) σ_{d}^{2} + f σ_{0}^{2} + σ_{G}^{2} + σ_{e}^{2}) ⊙ I_{j}, & i f i = j \\ I_{i} ⊙ (Δ_{7 i j} γ_{7} (f) + Δ_{8 i j} γ_{8} (f) + 2 Φ_{i j} σ_{G}^{2}) ⊙ I_{j}, & i f i \neq j \end{array} (14)$

where γ_l(f)s are matrices determined by 2 $σ_{a}^{2}$ , $σ_{d}^{2}$ and f etc., see there for details.

In the case of linkage to marker with both Hardy-Weinberg and linkage equilibrium, the covariance in our case can be specified based on that of, for example [40], in the same way as above. In the case of linkage to marker with either one or both Hardy-Weinberg and linkage disequilibrium, the covariance in our case can be specified based on that of [51], in the same way.

Competing risks

Now suppose that the response y_i is the failure time and only the failure for one of the d diseases is observed for each individual. For the i-th family, the data have the form (y_i,d_i,j_i,x_i), where y_i=(y_if,y_im,y_i1,… ,y_ib_i), similarly for d_i,j_i, and x_i, where j_i is the observed disease type indicator. For example if the observed disease for the father is type 2, then j_if=2. Given the data (y_i,d_i,j_i,x_i)s, we like to investigate the objective of interests for each of the d disease. This problem is that of the competing risks. Note here the response for each individual is one-dimensional, and hence the corresponding quantities have simple notations. We are interested in the genetic regression analysis for the competing risks. We use a variant of the proportional hazards model. The mean of the j-th type, r-th member of the i-th family is specified as

$μ_{j} (g_{i r})= I_{i r} ⊙ (μ_{0 j} + α_{j} χ (g_{i r}) + β_{i}) J_{i r} ⊙ x_{i r},(r = f, m,1,..., b_{i}).$

For a reasonably chosen function h(⋅), we specify

$h (μ_{i})=(\sum_{l =1}^{k} π_{l} h (μ_{j_{f}} (g_{l})))(\sum_{l =1}^{k} π_{l} h (μ_{j_{m}} (g_{l}))) \prod_{l =1}^{b_{i}} (\sum_{s =1}^{k} T (s) h (μ_{j_{l}} (g_{s}))).$

More convenient below is to use the notations

$h_{r} (μ_{i})= \sum_{l =1}^{k} π_{l} h (μ_{j_{r}} (g_{l})),(r = f, m)$

and

$h_{r} (μ_{i})= \sum_{s =1}^{k} T (s) h (μ_{j_{r}} (g_{s})| y_{p}),(r =1,..., b_{i}).$

Let y_j1<…<y_jkj be the k_j failures of type j(j=1,…,d), R(y_ji),be the risk set at y_ji, the partial likelihood is

$L (y | θ)= \prod_{j =1}^{d} \prod_{i =1}^{k_{j}} \frac{h_{r_{i}} (μ_{j})}{\sum_{l \in R (y_{j i})} h_{r_{l}} (μ_{j})} .(15)$

Asymptotic heuristic

For IID data, various asymptotic results can be obtained. The results from the score function, the likelihood ratio statistic, and the MLE are equivalent. These results can be used to establish confidence intervals or hypothesis testing, etc. for θ. Here we are more interested in using the MLE. For general dependence model, usually the treatment is non-standard. But for our model, since the log-likelihood is in the form of several additive pieces, standard method can be used to derive the asymptotic distribution of the MLE. Let zi=(yi,di)(i=1,…,n). For the IID data, it is well known that under mild regularity conditions, the MLE ${\hat{θ}}_{n}$ nis strongly consistent and asymptotically distributed normal with mean at the true parameter value θ0, and variance matrix given by the inverse of the Fisher information. Here the observations are unbalanced, the asymptotic variance is the Fisher information times a weight matrix. To derive it, we need some notations, and mainly concentrate on model (4).

Let Nip be the total number of parents with the i-th measurement non-missing (i=1,…,d), Niss be those for the siblings, N be the total number of individuals in the study, γN_ir=N_ir/N(r=p,s). Assume lim_{N→8γNir=γir>0} exists (r=p,s;i=1,…,d). Let Y_p and Y_j be general random vectors associated with a parent and sib respectively, and Δ_p and Δ_j be the corresponding random vectors.

For model (4), let

$H (θ | Y_{p}, Y_{j})= Δ_{p} ⊙ l o g (\sum_{l =1}^{k} π_{l} f (Y_{p} | θ, l)) + (1 - Δ_{p}) ⊙ l o g (\sum_{l =1}^{k} π_{l} S Y_{p} | θ, l))$

$+ Δ_{j} ⊙ l o g (\sum_{l =1}^{k} T (l) f (Y_{j} | G, θ, l)) + (1 - Δ_{j}) ⊙ l o g (\sum_{l =1}^{k} T (l) S (Y_{j} | G, θ, l)) (16)$

here we use the notation Δ_p v(⋅) to represent the marginal version of v(⋅) corresponding to the non-zero components of Δ_p. The Fisher information matrix is

$I (θ)= - E (\frac{\partial^{2} H (θ | Y_{p}, Y_{j})}{\partial θ^{'} \partial θ}).(17)$

The above expectation is more involved than it looks, since that involves summations of all possible combinations of non-zero elements of it with respect to Δ_r, and also the unknown distribution of it. Instead, an empirical version of it has a known form

$I_{N} (θ)= - \frac{1}{N} \frac{\partial^{2} l o g L (y | θ)}{\partial θ^{'} \partial θ} |_{θ = {\hat{θ}}_{n}},(18)$

where L(y\θ) is given by (4). At the true data generating parameter θ₀, I_N is strongly consistent for I(⋅). To obtain the weight matrix, we need to specify the parameter order in θ. We arrange the first k-1 entry to be π1,…,π_k-1, next we arrange all the regression parameters for the first response variable,..., all the regression parameters for the last response variable, then all the independent parameters in the variance matrix S and covariance matrix &Omega; in the similar order. It is clear that, in the weight matrix W, for (i,j) corresponding to the first k-1 components in θ, the weight should be γ_p=γ_1p+…+γ_dp; for (i,j) corresponding to the r-th and the l-th regression parameters, the weight is $\sqrt{(ϒ r p + ϒ r s) (ϒ l p + ϒ l s)}$ ; for (i,j) corresponding to the (a,b)-th and the (u,v)-th variance or covariance, the weight is [(γ_ap+γ_as) (γ_bp+γ_bs) (γ_up+γ_us) (γ_vp+γ_vs)]1/4. Let θ₀ be the true unknown data generating parameter, d→ stands for convergence in distribution. Then, we have

$\sqrt{N} ({\hat{θ}}_{N} - θ_{0}) \overset{d}{\to} N (0, I^{- 1} (θ_{0}) \otimes W) (19)$

where A⊗B stands for the Kronecker product of matrices A and B. Since I(⋅) involves unknown quantities, equivalently

$\sqrt{N} ({\hat{θ}}_{N} - θ_{0}) : N (0, I_{N}^{- 1} ({\hat{θ}}_{n}) \otimes W_{N}) (20)$

where W_N is W with the γs replaced by the γNs.

The above ideal applies to the other models in this paper, but the results will be more involved, and we only discuss them briefly.

For the proportional hazards model, even for the IID data case, the conditional likelihood looks much different from the full likelihood. Interestingly, the MLE from this model (under the assumption of correct model specification and some regularity conditions) has the same asymptotic distribution as that from the full likelihood [21,52]. For the proportional hazards model, it is noted [4] that the survival function can be written as

S(y\x,β)=-S0(y)^h(β'x),

and f(y\x,β)=-dS(y\x,β)/dy. S_o to get the full log-likelihood (14), we need the estimate of S0(y) for the d-dimensional case [29,30]. Proposed the multi-dimensional generalization of the Kaplan-Meyer nonparametric estimator of S(y), similar technique can be used here for the construction of S₀(y). Then (18) continues to hold in this case. Due to technical involvement, we will not pursue the details here.

For the least squares estimator, since the weight involves the kernel and h_n, the treatment is different from those above, and in the case of full observation generally the asymptotic result is of the form

$\sqrt{n h_{n}^{d}} ({\hat{θ}}_{n} - θ_{0} - h_{n}^{2} C - o_{p} (h_{n}^{2})) \overset{d}{\to} N (0, Ω),$

for some constant C and matrix Ω determined by the kernel and the true (unknown) data and censoring distributions [53,54]. In our case of partial observation, the above result holds with Ω replaced by Ω⊗W.

For the competing risks model, the structure is similar to that of the proportional hazards model. Here the response is one dimensional so that the survival function can be estimated by the Kaplan-Meyer estimator and the weight matrix W is the identity.

Discussion

We have considered several statistical methods, parametric, semi parametric and nonparametric models, for the genetic regression analysis of familial multiple endpoints data, with possible missing records. Here we only considered the case of nuclear families and the parameters are independent of time. The cases of arbitrary pedigrees and/or the time dependent parameter can be treated similarly. The variance components method can also be applied to the proportional hazards model and in the analysis of competing risks. There are some marginal models for the multiple endpoints data, which work well in practice. But we think the joint model is more appropriate when the within responses structure is important in the analysis. Another commonly used method to deal with the missing data is the EM algorithm [55], which can be implemented into the models considered here. But for the multiple endpoints data, the proportion of missing part is usually large; the EM algorithm may not be efficient. When the missing pattern is non-ignorable, more complicated approaches need to be considered to reduce potential biases. Hypothesis testing for parameters of interests can be conducted using the likelihood ratio statistics based on the parametric models. We only derived the basic forms of these models; more features can be implemented to them in particular applications.