Simulating Clustered and Dependent Binary Variables

Special Article - Biostatistics Theory and Methods

Austin Biom and Biostat.2015;2(2): 1020.

Simulating Clustered and Dependent Binary Variables

Aobo Wang and Roy T Sabo*

Department of Biostatistics, Virginia Commonwealth University, USA

*Corresponding author: Sabo RT, Department of Biostatistics, Virginia Commonwealth University, 830 East Main Street, Richmond, VA 23298-0032, USA.

Received: June 01, 2015; Accepted: June 11, 2015; Published: June 19, 2015


Dependent binary data can be simply simulated using the multivariate normal- and multinomial sampling-based approaches. We extend these methods to simulate dependent binary data with clustered random effect structures. Several distributions are considered for constructing random effects among cluster-specific parameters and effect sizes, including the normal, uniform and beta distributions. We present results from simulation studies to show proof of concept for these two methods in creating data sets of repeated-measure binary outcomes with clustered random effect structures in various scenarios. The simulation studies show that multivariate normal- and multinomial sampling approaches can be successfully adapted to simulate dependent binary data with desired random effect structures.

Keywords: Dependent binary data; Clustered random effect; Simulated data


MVN: Multivariate Normal; CDF: Cumulative Distribution Function; PDF: Probability Density Function; MS: Multinomial Sampling


Methods for simulating dependent binary outcomes are often required for the assessment of statistical methodologies suitable for repeated measure study designs with dichotomous outcomes. Such simulation techniques can also be useful in determining required sample sizes for longitudinal study designs featuring binary measurements. Emrich and Piedmonte [1] developed a goldstandard method for simulating dependent binary outcomes based on the multivariate normal distribution. Haynes et al. [2] introduced an approach based on the multinominal distribution of all possible combinations of the binary outcomes. Both of these approaches were extended to account for modeling dependencies with odds ratios in Sabo et al. [3].

While useful for repeated-measures or multiple-outcome studies, these methods require expansion if they are to be used in more complicated situations. For instance, certain research studies feature inherent clustering, where groups of subjects exist in natural clusters or groups. Examples include studies of school-age children attending various class rooms or schools [4], or primary care patients who attend one of several primary care facilities [5], the latter of which also features patients nested within primary care physicians, who are in turn nested within primary care practices that are nested within larger health care systems. The previously mentioned simulation approaches cannot incorporate this type of complexity without amendment and are unsuitable as currently constructed to simulate clustered repeated measure data that would mimic such a scenario.

In this manuscript, we extend the multivariate normal- and multinomial sampling-based approaches for simulating dependent binary outcomes to also incorporate a desired cluster structure. This extension requires probabilistically generating parametric simulation templates for each of the desired cluster levels or combinations. Several simple probability distributions are used to exemplify the process of establishing the cluster-specific parameters and effect sizes, including the normal, uniform and beta distributions. The rest of this manuscript is outlined as follows. The two simulation methods are briefly described in the next Section, and are extended to account for a desired cluster structure. The performances of these extensions are then examined through simulation studies. A brief discussion concludes the manuscript.

Materials and Methods

Simulation methodologies: Multivariate normal approach

The simulation approach by Emrich and Piedmonte [1] utilizes the multivariate normal distribution to generate vectors exhibiting desired dependence levels, which are then categorized into binary observations. The process begins by using the desired pair wise correlations between binary measures and with marginal probabilities and to solve for a bivariate correlation using the bivariate normal Cumulative Distribution Function (CDF), (1)

where is the percentile of the standard normal distribution and q = 1 - p. Odds ratios could be used in place of correlations by replacing the right-hand side of Equation (1) with the Plackett copula [6], where is the desired odds ratio, as shown in Sabo et al. [3]. The values are then placed into a correlation matrix and used to simulate a multivariate normal vector. Binary observations are then created by classifying each element of by letting if and otherwise. This process can be repeated by generating and classifying such vectors to create the desired simulated sample.

Simulation methodologies: Multinomial approach

The multinomial-based simulation method introduced by Haynes et al. [2] uses a multinomial distribution of all possible combinations of dependent binary outcomes, which can be created through the joint and marginal probabilities, along with the desired correlation. Given a desired correlation between binary variables and with desired marginal probabilities and, we first calculate the joint probability using the following expression. (2)

Note that if odds ratios are used instead of correlations, then can be solved for by inserting the desired odds ratio and marginal probabilities and into the Plackett copula, as described in Sabo et al. [3]. Note that whether correlations or odds ratios are used to model dependence, the remainder of the multinomial-based approach is identical after the pair-wise joint probabilities are calculated.

If three or more dependent binary measures are to be simulated, then higher order joint probabilities must be calculated. Let represent the joint probability, which is not uniquely defined by the marginal probabilities and the correlation. As shown in Chaganty and Joe [7], the minimum and maximum are defined as follows, (3)

where any value leads to a valid probability density function with the desired marginal probabilities and dependence level. Though any value in this range is appropriate, we take the midpoint. Higher order joint probabilities in cases of four or more dependent binary observations can be determined in a similar manner, though the calculations become more tedious as the number of observations increases.

These quantities are used to calculate the multinomial Probability Density Function (PDF) of all combinations of outcomes, which for the two-variable case are shown in the first two columns of Table 1. The CDF is created by progressively summing the values of the PDF, where the subscripts on indicate whether each binary outcome is successful, with 1 for success and 0 for failure. For example, After the CDF is determined, a random number is simulated, and the simulated observations are generated based on the decision rules based on the CDF, as shown in the last two columns of Table 1. For example, if, then the observation is recorded as and, or simply as 10. This process can be repeated to generate a sample of dependent binary outcomes. A similar approach – outlined in Haynes et al. [2] – can be used in cases of three or more dependent binary outcomes.