Finite Mixture Models and Their Applications: A Review

Hanze Zhang and Yangxin Huang*

Department of Epidemiology and Biostatistics, College of Public Health, University of South Florida, USA

*Corresponding author: Yangxin Huang, Department of Epidemiology and Biostatistics, College of Public Health, University of South Florida, Tampa, Florida 33612, USA.

Received: February 02, 2015; Accepted: March 18, 2015; Published: March 26, 2015

Abstract

Finite Mixture (FM) models have received increasing attention in recent years and have proven to be useful in modeling heterogeneous data with a finite number of unobserved sub-population. It has been not only widely applied to classification, clustering, and pattern identification problems for independent data, but could also be used for longitudinal data to describe differences in trajectory among these subgroups. However, due to the computational convenience, the most types of FM models are based on the normality assumption which may be violated in certain real situations. Recently, FM models with non-normal distributions, such as skew normal and skew t-distribution, have received increasing attention and showed the advantages in modeling data with non-normality and heavy tails. One of the advantages of FM models is that both maximum likelihood method and Bayesian approach can be applied to not only estimate model parameters, but also evaluate probabilities of subgroup membership simultaneously. We present a brief review of FM models for these two types of data with different scenarios.

Keywords: Finite mixture models; Heterogeneity; Longitudinal data; Nonnormal distributions

Introduction

Over 100 years ago, the famous biometrician Pearson [1] helped his colleague solve a problem in accommodating apparent skewness of crab sample adequately by one symmetric normal distribution. With a strong feeling of that this population was evolving toward two new subspecies, he fitted a mixture of two normal probability density functions with different means and variances in two proportions. After Pearson firstly proposed the word “mixture” in statistics, not surprisingly, various attempts were conducted to dig deeper in this field.

Most of the statistical models assume that a sample of observations comes from the same distribution. Sometimes, however, it may not be true, since the sample may be drawn from numbers of distinct populations in which the populations are not identified. In this situation of homogeneity assumption violated, Finite Mixture (FM) models could bring the rescue. FM models provide a flexible frame work to handle heterogeneous data with a finite number of unobserved sub-population, and also have been widely applied to classification, clustering, and pattern identification problems [2-5]. FM models have attracted considerable research interest recently and have been widely applied to independent data. Recently, the use of FM models for longitudinal data has also received increasing attention. This article is organized to provide a brief overview of FM models for these two types of data with different scenarios.

Models with normal mixture for independent data

When a FM model is a convex combination of two or more probability density functions, it can be formally written as a mixture with K component distributions:

$f (x) = \sum_{k = 1}^{K} w_{k} f_{k} (x)$ (1)

Where wk> 0 (k=1, 2, …, K) is the mixing weights with $\sum_{k} w_{k} = 1.$

In modeling independent data, FM models allow for parameter differences across the unobserved classes. In other words, fk (x) in (1) are all from the same parametric family, but with different parameters. Many distributions have been applied as the parametric family of the components in the mixture model.

Due to the computational convenience, normal distributed components have been widely used [6]. It could be easily fitted iteratively by Maximum Likelihood (ML) via the Expectation- Maximization (EM) algorithm [6-8]. Briefly, EM algorithm includes the following steps: 1) start with initial values about the mixture components and the mixing weights w₁,…w_k; 2) use the current parameter guess, calculate the weights (E-step), then use the current weights, maximize the weighted likelihood to obtain new parameter estimates (M-step); 3) repeat steps 1) and 2) iteratively until convergence of algorithm, and then return the final parameter estimates and component probabilities. Several researchers have published program for the parameter estimation of FM models using EM algorithm [9-12]. Additionally, FM model has also been studied from a semi parametric prospective [13,14]. In terms of its flexibility, FM model with normal distribution has been widely applied in different areas, including, but not limited to, medicine [15-17], genetics [18-21], public health [22], psychology [23,24], and economics [25,26].

Models with normal mixture for longitudinal data

Although most of the FM models focus mainly on independent data, mixtures have also been developed for modeling longitudinal data, where the latent classes corresponding to the components and individual clusters provide a better data fitting. It aimed at identifying multiple unobserved sub-groups, and describing differences in longitudinal change among these subgroups.

Generally, the density function of FM models for longitudinal data can be written as

$f (y_{i}) = \sum_{k = 1}^{K} w_{k} f_{k} (g_{k} (β_{i k}, x_{i k}); σ_{k}^{2} (β_{i k}, x_{i k}))$ (2)

where y_i denotes a vector of repeated observations for subject i which is assumed to be from the k component; $f_{k} (g_{k} (β_{i k}, x_{i k}); σ_{k}^{2} (β_{i k}, x_{i k}))$ is the density function for k^th component with a mean function gk (.) and variance function σ_k ²(.); β_ik and x_ik denote unknown subject specific parameters and known covariates, respectively. Similarly, w_k>0 are the mixing weights with $\sum_{k} w_{k} = 1.$ For g_k (.), both linear (polynomial) and non-linear mean function could be applied, but former one is more widely used, partially because the inference process can be conveniently carried out by ML approach [27]. While formularizing the FM models for longitudinal data, the mean functions of components can be similar forms with varying means and/or variance specifications, or have totally different mean trajectories across the components [28].

FM models for longitudinal data, also named as growth mixture models, were presented by Verbeke [29] and Muthen [27]. Growth mixture model is built up by combining the random effect from mixed effects models and finite mixtures, which allows same mean function but with different sets of parameter values (growth factors) across components capturing latent trajectory classes with different curve shapes [30-32]. It could be considered as an extension of the conventional Linear Mixed-Effect (LME) model with different latent classes of development. Both EM algorithm [27] and Bayesian methods [33,34] were used for estimating both model parameters and subclass membership probabilities. The relative developments, called latent class growth analysis [35-37], were special cases that assume no inter-individual differences in change within-class. In other words, it specifies that all individuals in one trajectory class behave the same, which allows more straightforward interpretations.

All of the mixtures above had the assumption of normally distributed variables within each latent class. According to computational convenience of the normality assumption, many extensions and applications have been presented in different fields, such as medicine [38-40], psychology [41,42], social science [43-45] and pharmacokinetic/pharmacodynamic [46].

Models with non-normal mixture for independent data

In many real situations, however, the data contain longer than normal tails or atypical observations, the use of normal components may affect the fit of the model and, in turn, lead to biased results. The FM model of t-distribution was considered as an alternative, which provides a more robust approach of fitting mixtures and computes less extreme estimates of the posterior probabilities of the component membership [47-50]. It has proven to accommodate outliers in modeling data with heavy tails by an additional parameter, the degrees of freedom, compared to that with normal distribution. Expectation- Conditional Maximization (ECM) algorithm [47,48] and Bayesian approach [51] were used to fit the FM models with t-distribution. In practice, FM model with t-distribution has been implemented to wide fields, including genetics [52,53], medicine [54-56] and engineering [57].

In addition to feature of heavy tails, in many applied problems, data commonly involve highly asymmetric feature. The FM models with symmetric distributions, such as normal and t-distributions can be misleading when handling data with skewness. Recently, asymmetric distribution-based mixture models, particularly, the Skew-Normal (SN) [58-63] and Skew-t (ST) mixture models [62,64- 67] have received increasing attention and been developed as a critical extension to traditional models with symmetric distributions for modeling data with asymmetry, heavy tails, and the presence of outliers.

The FM models of SN distribution can provide a more appropriate density estimation to fit the asymmetric observations by adding an additional shape/skewness parameter, compared to the normal mixtures. Model fitting could be conducted by both EM algorithm [58,59] and Bayesian approach using Markov Chain Monte Carlo (MCMC) method [58,62]. Its flexibility and robustness against skewness has been proven in the real data, such as genetic data [68], transportation data [69], and environmental data [70].

As a natural extension of the student t and skew normal mixtures, FM model with ST distribution has showed its advantages in modeling data with both asymmetry and heavy tails simultaneously. Compared to SN and student t distribution, the ST distribution has extra parameters, degrees of freedom and shape/skewness parameter. Therefore, FM models with normal, student-t and SN distributions can be statistically viewed as special cases of the ST mixture models. Lee and McLachlan [71-73] suggested that the existing ST distributions could be classified into four forms, including restricted, unrestricted, extended and generalized forms. The EM algorithm was used for fitting mixtures of both restricted and unrestricted ST distribution [65,71,74]. The unrestricted ST mixture model has a more general characterization than various mixture models of restricted ST mixture model, and hence is able to regulate the asymmetric behaviors across components with greater flexibility [71]. A Bayesian approach implemented by MCMC scheme could also be applied to make inference for FM models with ST distribution in great efficiency [62]. Its application was found in various areas, including biology [66,75], bioinformatics [76], transportation [69] and astrophysics [77].

Other than these distributions above widely used in FM models, some alternative non-normal distributions have also received some attention, including normal inverse Gaussian distribution [78,79], skew t-normal distribution [80], Shifted Asymmetric Laplace (SAL) distribution [81], and generalized hyperbolic distributions [82]. Franczak [81] suggested that the SAL mixture models offered nearperfect results on the data whereas the mixture models with normal distribution consistently overestimated the number of components.

Models with non-normal mixture for longitudinal data

Similar to the FM model for independent data discussed above, when the repeated observations, y_i in (2), are truly non-normally distributed, the model with normal assumption is not robust and can lead to poor estimation and inference [83]. In this case, nonnormal FM models for longitudinal data should be considered, because it fits the data better than normal mixture. Although most of non-normal distributions such as SN [84] and ST distributions [85] used on FM model for independent data could be applied on longitudinal cases, ST distribution was most widely implemented by adding a skew parameter and degrees of freedom parameter. Either the random effects or the residual of the model could be assumed an ST distribution. Recently, for example, Muthén [85] introduced a new growth mixture model with ST distributed random effects.

In addition to FM models with linear (polynomial) or piecewise linear mean functions, the mixture models with different nonlinear mean components have obtained increasingly attention. For instance, to explicitly estimate the HIV viral load trajectories, Huang et al. [86] constructed three different mean functions for three potential subgroups with ST distribution, including one-compartment model with a constant decay rate, two compartment model with constant decay rates, and two compartment model with constant and timevarying decay rates, respectively, and made inference for the ST-FM models from Bayesian prospective. Furthermore, in addition to nonnormality, Huang et al. extended FM models by considering other longitudinal data features simultaneously, including measurement errors in covariates [87-89], non-ignorable missing mechanism [87,89-91], left-censored response [92], and time-to-event outcomes [93].

Discussion

Recent decades, FM model has proved to be one of the most powerful model-based approaches dealing with data in the presence of population heterogeneity. This heterogeneity could be detected by visual methods, such as scatter plot and histogram. For instance, a bimodal or even multi-modal distribution for independent data and distinct trajectories for longitudinal data strongly suggest the existence of heterogeneity or subgroups. FM models could handle this data feature not only by providing model parameter estimates, but also allowing estimate of model-based probabilistic clustering to obtain class membership probabilities. Recent developments and extensions in FM models offer increasing ability and flexibility in capturing independent or longitudinal data with different data features, which can benefit applications in various scientific areas.

The optimal number of mixture components selection is an important but difficult problem in FM models. Since the conventional likelihood-ratio test comparing k and k+1 components FM models is not appropriate, adjusted Lo-Mendell-Rubin Likelihood-Ratio Test (Adjusted LRT) obtained the agreement in selecting the model with optimal number of components [94]. An alternative approach to determine the optimal number of components is to compare the information criteria, such as Akaike’s Information Criteria (AIC) [95], Bayesian Information Criteria (BIC) [96], and Sample-Size Adjusted BIC (SSABIC) [97]. However, most of these criteria are very sensitive to sample size, and favor highly parameterized models. Thus, it is suggested that these information criteria should be considered with other evidence. Additionally, entropy has also been considered as a criterion for components number selection. Entropy assesses weather one subject was classified neatly into one and only one subgroup, with higher value (> 0.80) indicating better classification [98]. As this issue has not been completely resolved, it is good to apply different criteria simultaneously to determine the optimal number of components for FM models.

As a constrained exploratory technique, FM model seeks the patterns that data are trying to tell, but what can be learned is limited by what is entered. In other words, the final model is the best representation of the data, given the specifications of the model before the estimation algorithm. Whether they represent the true heterogeneous patterns is unknown. Thus, we suggest researchers to obtain further evidence that the unobserved subgroups really exist by replicating findings with another data, and identifying the association between subgroup membership and other measured variables.

Initial value selection and convergence issue often appear in model estimation via EM or ECM algorithm for computationally intensive FM models. With general form of skew distributions, sometimes it may not be able to get closed form for the conditional expectations involved in the E-step of the EM algorithm. Starting with different sets of initial values is strongly recommended, which helps determine whether these values all result in the same solution. Non-normal distributed mixtures need more random initial values than normal mixtures to replicate the best log-likelihood given a typically less smooth likelihood function. To avoid these problems happened in EM or ECM algorithms, Bayesian approach with MCMC technique, which has attracted the attention in this field, could be a rescue.

Other cautions of FM models should also be addressed. First, the computational load of complicated FM models, especially mixtures with non-normal distributions for longitudinal data, is extremely heavy. Second, for inference of FM models, parameter (or model) identifiability can be a critical but difficult problem when a large number of model parameters must be estimated simultaneously. Each component model must be ensured to be identifiable, and then the whole mixture model could be identifiable. If the model is not properly identified, it is possible that many different sets of parameter estimates would appear. Moreover, models comparison and goodness fit tests need to be further developed, not only focusing on the difference in the number of latent classes, but also in their randomeffects specification. Finally, FM model is a statistical procedure which is usually based on large sample size.

In summary, FM model is a fast developing statistical approach for modeling independent or longitudinal data with heterogeneity. This article provides an up-to-date brief overview of the developments in FM models for both independent and longitudinal data. Compared to independent data, studies on FM model for complicated longitudinal data are still relatively limited, and few studies include time-varying predictors, but we believe that more and more important and interesting results in this area will be reported in the near future.

A final note that we would like to make is possible software to implement FM models. The most widely used software for FM models are EMMIX [99] and Mplus [100]. Other available software designed for certain specific situations include, but not limited to, AUTOCLASS [101], NORMIX [102], and MIX [103]. Several R packages are also available to implement FM mixture models, including ‘mclust’ [104], ‘mixtools’ [105], ‘FlexMix’ [106]. When the mean functions of components are very complicated, especially for longitudinal data with non-normal distributions, which bring extremely heavy computational load, the Bayesian method shows its advantages. The WinBUGS software [107] interacted with the package ‘R2WinBUGS’ in R is a good choice.