Special Article - Biostatistics Theory and Methods
Austin Biom and Biostat.2015;2(2): 1020.
Simulating Clustered and Dependent Binary Variables
Aobo Wang and Roy T Sabo*
Department of Biostatistics, Virginia Commonwealth University, USA
*Corresponding author: Sabo RT, Department of Biostatistics, Virginia Commonwealth University, 830 East Main Street, Richmond, VA 23298-0032, USA.
Received: June 01, 2015; Accepted: June 11, 2015; Published: June 19, 2015
Abstract
Dependent binary data can be simply simulated using the multivariate normal- and multinomial sampling-based approaches. We extend these methods to simulate dependent binary data with clustered random effect structures. Several distributions are considered for constructing random effects among cluster-specific parameters and effect sizes, including the normal, uniform and beta distributions. We present results from simulation studies to show proof of concept for these two methods in creating data sets of repeated-measure binary outcomes with clustered random effect structures in various scenarios. The simulation studies show that multivariate normal- and multinomial sampling approaches can be successfully adapted to simulate dependent binary data with desired random effect structures.
Keywords: Dependent binary data; Clustered random effect; Simulated data
Abbreviations
MVN: Multivariate Normal; CDF: Cumulative Distribution Function; PDF: Probability Density Function; MS: Multinomial Sampling
Introduction
Methods for simulating dependent binary outcomes are often required for the assessment of statistical methodologies suitable for repeated measure study designs with dichotomous outcomes. Such simulation techniques can also be useful in determining required sample sizes for longitudinal study designs featuring binary measurements. Emrich and Piedmonte [1] developed a goldstandard method for simulating dependent binary outcomes based on the multivariate normal distribution. Haynes et al. [2] introduced an approach based on the multinominal distribution of all possible combinations of the binary outcomes. Both of these approaches were extended to account for modeling dependencies with odds ratios in Sabo et al. [3].
While useful for repeated-measures or multiple-outcome studies, these methods require expansion if they are to be used in more complicated situations. For instance, certain research studies feature inherent clustering, where groups of subjects exist in natural clusters or groups. Examples include studies of school-age children attending various class rooms or schools [4], or primary care patients who attend one of several primary care facilities [5], the latter of which also features patients nested within primary care physicians, who are in turn nested within primary care practices that are nested within larger health care systems. The previously mentioned simulation approaches cannot incorporate this type of complexity without amendment and are unsuitable as currently constructed to simulate clustered repeated measure data that would mimic such a scenario.
In this manuscript, we extend the multivariate normal- and multinomial sampling-based approaches for simulating dependent binary outcomes to also incorporate a desired cluster structure. This extension requires probabilistically generating parametric simulation templates for each of the desired cluster levels or combinations. Several simple probability distributions are used to exemplify the process of establishing the cluster-specific parameters and effect sizes, including the normal, uniform and beta distributions. The rest of this manuscript is outlined as follows. The two simulation methods are briefly described in the next Section, and are extended to account for a desired cluster structure. The performances of these extensions are then examined through simulation studies. A brief discussion concludes the manuscript.
Materials and Methods
Simulation methodologies: Multivariate normal approach
The simulation approach by Emrich and Piedmonte [1] utilizes the multivariate normal distribution to generate vectors exhibiting desired dependence levels, which are then categorized into binary observations. The process begins by using the desired pair wise correlations between binary measures and with marginal probabilities and to solve for a bivariate correlation using the bivariate normal Cumulative Distribution Function (CDF), (1)
where is the percentile of the standard normal distribution and q = 1 - p. Odds ratios could be used in place of correlations by replacing the right-hand side of Equation (1) with the Plackett copula [6], where is the desired odds ratio, as shown in Sabo et al. [3]. The values are then placed into a correlation matrix and used to simulate a multivariate normal vector. Binary observations are then created by classifying each element of by letting if and otherwise. This process can be repeated by generating and classifying such vectors to create the desired simulated sample.
Simulation methodologies: Multinomial approach
The multinomial-based simulation method introduced by Haynes et al. [2] uses a multinomial distribution of all possible combinations of dependent binary outcomes, which can be created through the joint and marginal probabilities, along with the desired correlation. Given a desired correlation between binary variables and with desired marginal probabilities and, we first calculate the joint probability using the following expression. (2)
Note that if odds ratios are used instead of correlations, then can be solved for by inserting the desired odds ratio and marginal probabilities and into the Plackett copula, as described in Sabo et al. [3]. Note that whether correlations or odds ratios are used to model dependence, the remainder of the multinomial-based approach is identical after the pair-wise joint probabilities are calculated.
If three or more dependent binary measures are to be simulated, then higher order joint probabilities must be calculated. Let represent the joint probability, which is not uniquely defined by the marginal probabilities and the correlation. As shown in Chaganty and Joe [7], the minimum and maximum are defined as follows, (3)
where any value leads to a valid probability density function with the desired marginal probabilities and dependence level. Though any value in this range is appropriate, we take the midpoint. Higher order joint probabilities in cases of four or more dependent binary observations can be determined in a similar manner, though the calculations become more tedious as the number of observations increases.
These quantities are used to calculate the multinomial Probability Density Function (PDF) of all combinations of outcomes, which for the two-variable case are shown in the first two columns of Table 1. The CDF is created by progressively summing the values of the PDF, where the subscripts on indicate whether each binary outcome is successful, with 1 for success and 0 for failure. For example, After the CDF is determined, a random number is simulated, and the simulated observations are generated based on the decision rules based on the CDF, as shown in the last two columns of Table 1. For example, if, then the observation is recorded as and, or simply as 10. This process can be repeated to generate a sample of dependent binary outcomes. A similar approach – outlined in Haynes et al. [2] – can be used in cases of three or more dependent binary outcomes.
CDF
Decision Rule and Simulated Outcome
11
10
01
00
Table 1: Two-variable PDF, CDF and decision rules for multinomial approach.
Accounting for random effects by generating the simulation templates
For the two simulation approaches discussed in Section 2, we simulate a set of binary data representative of a single population by repeating either process times using a single simulation template, which consists of all desired marginal probabilities and pairwise correlations (or odds ratios ). To generate two or more groups of simulated binary observations, where groups are differentiated by either different marginal probabilities, dependencies, or both, the simulation approach is repeated separately for each group with the desired simulation template.
Expanding the multivariate- and multinomial-based approaches to account for random effects (say from clustering) requires only a simple extension of the process used when simulating binary observations for multiple groups. As a motivating example, let’s assume we want to simulate clusters of samples of two correlated binary measures. Let’s further assume that those two measures have marginal probabilities that vary across the clusters in such a way that the averages are and the corresponding cluster variances for those rates are and.
First we assume that the marginal probability for each binary measure has some probability distribution, where is some probability mass or density function and is some parameter (possibly vectorvalued) selected such that and for group. If we desire clusters of simulated observations, then we simulate marginal probabilities and from and, respectively, for. For cluster, we simulate the desired number of dependent binary observations using and and the desired dependence level (or). This process is repeated for, and the resulting clusters of simulated data will on average exhibit a distribution of marginal probabilities centered around and , though the clusterspecific marginal means will vary according to and, thus achieving the desired level of clustering.
In the previous scenario, the marginal probabilities were given probability distributions and themselves simulated times to achieve a clustering effect. An equivalent approach would be to simulate probabilistically to achieve a desired mean and variance for the first marginal mean across clusters, and then simulate some and define, where is some probability distribution not necessarily of the same form as , and is some parameter (possibly vector-valued) such that yields the desired difference between and with some desired cluster variability.
This approach extends naturally to more complicated scenarios, including cases of two or more clustering factors, or even nested factors. The unifying theme is that data are simulated uniquely for each combination of clusters, mainly through parametric templates that are probabilistically generated for each combination. For example, in the case of hierarchical clustering, where one factor is nested within the levels of another, the parameters used to simulate the parameter values used to simulate data for each cluster can itself be probabilistically determined. Further, the researcher has much discretion in selecting how those factors or levels affect the particular probability distribution and parameters used to simulate the simulation template for each cluster. The dependence levels between the binary outcomes can also be made to be cluster- or leveldependent, provided a distribution is selected that offers control in selecting the desired dependence while also ensuring the proper support.
Distribution examples
We will consider three examples of distributions that can be used in this simulation process, understanding that there are alternative and potentially more suitable options available. The only requirement is that the support of the distribution must be equal to, be a proper subset of, or have a reasonably low probability of occurring outside. A simple choice would be to simulate the marginal probabilities from a uniform distribution such that for clusters and binary outcomes, where the midpoint of and yields the desired marginal mean. In this case the inter-cluster variability in marginal probabilities can be controlled by increasing or decreasing the difference, making it wider for greater variability and narrower for less variability. In this case the parameters can be selected such that.
Another example would be to simulate the marginal probabilities from a beta distribution such that for clusters and binary outcomes, where shape parameters and are selected so that the mode is equal to the desired marginal probability (i.e.). There are infinite pairings of the shape parameters that give the same mode, so the inter-cluster variability in the marginal probabilities is controlled by making both and larger (for less variability) or smaller (for more variability). Since the support of the distribution matches that of proportions and probabilities, we are assured that.
The final example we consider is to simulate marginal probabilities from a normal distribution with low variance such that, where is the desired marginal probability and is the desired inter-cluster variation. While this choice of distribution has infinite support, the variance can be made small enough to all but ensure values are greater than 0 and less than 1 while simultaneously providing the desired variability. The simulations can also be truncated so that if the simulated value is less than and if it is greater than. Using the normal distribution also has the advantage of providing more direct control of the inter-cluster variability in marginal proportions as compared to the anddistributions.
Extension to existing approaches
For the multivariate normal-based approach, the simulated marginal probabilities for clusters are matched with the desired dependence levels (or) and are used in Equation 1 separately for each cluster. At this point, the process continues as stated in Section 2.1. Likewise, for the multinomial-based approach, the simulated marginal probabilities are matched with the desired dependence levels and used to determine the joint pairwise probabilities in Equation 2. Thereafter, the multinomial approach continues as stated in Section 2.2.
Results and Discussion
Simulation study
Here the performance of the multivariate normal and multinomial approaches to simulating dependent binary data with random effects is examined through simulations studies. The first case illustrates the simple situation where we simulate dependent binary outcomes over clusters. A second case looks at the situation where we simulate dependent binary outcomes over clusters, each consisting of both treatment and control subjects. For each case we assume the correlation between the two outcomes is, irrespective of group and cluster. Sample size was fixed at subjects per cluster. For both cases we also investigate the use of the, and distributions for generating the simulation templates and incorporating the clusterlevel variability. A total of data sets was created for each combination of distribution and simulation method, and are used to estimate the average overall marginal probability for each measure, the standard deviation and distribution of those means, the average effect size (and standard deviation) for the case-specific hypothesis test, the empirical power for the case-specific hypothesis test, the mean and distribution of the inter-cluster variability, the mean estimated correlation, and the percentage of data sets for which the desired model converged. SAS (version 9.4, Cary, NC, USA) was used to simulate data and fit generalized linear mixed models using the IML and GLIMMIX procedures, respectively.
Case 1: Clustered, one-group, repeated-measure study
In this case we simulate binary outcomes over clusters, where the global marginal probabilities for the two outcomes are and, indicating that the rate of our simulated outcome increases by after some time (possibly after an intervention). To incorporate intercluster variance we simulate the marginal probabilities according to the specifications listed in Table 2. Here we see that: the midpoints of the two distributions are and, respectively; the modes of the two Beta distributions are and, respectively; and the means of the two normal distributions are and; in each case matching the target levels. These values also imply that the inter-cluster variability in the marginal means is for both measures with the distribution, and for the distribution, and for both measures with the Distribution. The intended model is fit with a fixed two-level “time” effect, a clusterlevel random effect to account for the inter-cluster variation, and a subject-level random effect to account for the correlation between the measures. The null hypothesis is no difference over time (i.e.) against a two-sided alternative.
Marginal
Distribution
Time
Mean
Uniform
Beta
Normal
1
2
Table 2: Simulation template for case one.
The aggregate results over the simulations for case 1 are found in Table 3. Here we see that both the MS and MVN simulation approaches were accurate in reproducing the marginal proportions and, as well as the difference. We see that the MS and MVN approaches everywhere provided similar estimates and standard errors. The variability of these estimates is low and is also comparable between approaches. The cluster random effects averaged over all simulations are also provided; note these will not necessarily correspond to the theoretical inter-cluster variances stated earlier as these are model-derived and based on linked expectations in the generalized linear mixed model framework. The target correlation () was also achieved by both methods, with reasonably small variance. The MS approach produced data sets that converged at least, while the MVN approach always converged. The empirical powers for testing the null hypothesis of no difference in change over time between the two groups for the both the multivariate normal and multinomial approaches were for each of the three distributions (not shown in Table 3).
Dist.
Approach
(SE)
(SE)
(SE)
Cluster Random
Effect (SE)
(SE)
% Converged
Uniform
MS
0.248
(0.015)
0.449
(0.015)
0.201
(0.023)
0.034
(0.020)
0.192
(0.021)
99.0%
MVN
0.250
(0.016)
0.449
(0.017)
0.199
(0.022)
0.033
(0.018)
0.191
(0.023)
100%
Beta
MS
0.257
(0.017)
0.453
(0.025)
0.196
(0.030)
0.075
(0.035)
0.181
(0.023)
99.8%
MVN
0.261
(0.019)
0.457
(0.026)
0.196
(0.031)
0.076
(0.033)
0.180
(0.022)
100%
Normal
MS
0.247
(0.026)
0.447
(0.024)
0.201
(0.034)
0.102
(0.044)
0.170
(0.025)
100%
MVN
0.247
(0.025)
0.450
(0.020)
0.203
(0.035)
0.100
(0.045)
0.170
(0.023)
100%
Table 3: Simulation results for case one.
While comparisons of the simulation results between the different simulating distributions (Uniform, Beta and Normal) are unnecessary, we can briefly investigate their behavior. The inter-cluster variability estimates using the normal distribution to generate the simulation template were for the MS approach and for the MVN approach. If these levels are deemed too large, then the variance assumed in the simulation template (here) can be lowered. Likewise, the intercluster variability can be increased or decreased using the Uniform distribution by either increasing or decreasing the range about the desired proportions. While the process for the Beta distribution requires solving one equation for two unknowns (such that the given scale and shape parameters provided a desired mode), their sum can be increased or decreased to either decrease or increase the desired intra-cluster variability.
Case 2: Clustered, two-group, repeated-measure study
In this case we simulate binary outcomes over clusters, where subjects in half the clusters belong to a treatment group and where subjects in the other half of the clusters belong to a control group. Assuming an effective treatment, the global marginal probabilities for the two outcomes in the treatment group are and, while for an ineffective control the global marginal probabilities are; these values indicate that the difference in the changes over “time” is. To incorporate inter-cluster variance we simulate the marginal probabilities according to the specifications listed in Table 4. As in the previous case, we can easily show that that each distribution obtains the target marginal probability for that Group and time. The intercluster variabilities are similar to what was described before. The intended model is fit with a fixed two-level “time” effect, a fixed twolevel “group” effect, a group-time interaction, a cluster-level random effect to account for the inter-cluster variation, and a subject-level random effect to account for the correlation between the measures. The null hypothesis is no difference in change over time between the two groups (i.e.) against a two-sided alternative.
Marginal
Distribution
Group
Time
Mean
Uniform
Beta
Normal
Treatment
1
2
Control
1
2
Table 4: Simulation template for case two.
The aggregate results over the simulations for case are found in Table 5. Here we see again that both approaches were effective in estimating the marginal means as well as the desired difference in the change in proportions over time (), and the efficiencies of these estimates were similar for both methods. The estimated intercluster variance and the correlation between the repeated measure outcomes were similar between the MS and MVN approaches, while the estimated correlations were also close to the desired level (). At least of the data sets generated by the MS approach allowed models to converge, and at least of the MVN-derived data sets allowed model convergence. The empirical powers for the testing the null hypothesis of no difference in change over time between the two groups for the multinomial approach were (Uniform), (Beta) and (Normal), and were (Uniform), (Beta) and (Normal) for the multivariate normal approach (not shown in Table 5).
Uniform
Beta
Normal
MS Approach
MVN Approach
MS Approach
MVN Approach
MS Approach
MVN Approach
0.247
(0.022)
0.246
(0.023)
0.260
(0.026)
0.259
(0.024)
0.248
(0.036)
0.246
(0.034)
0.450
(0.024)
0.452
(0.023)
0.452
(0.036)
0.452
(0.037)
0.449
(0.036)
0.450
(0.035)
0.248
(0.023)
0.247
(0.022)
0.259
(0.025)
0.260
(0.026)
0.243
(0.035)
0.244
(0.035)
0.249
(0.024)
0.248
(0.023)
0.261
(0.025)
0.258
(0.026)
0.245
(0.034)
0.246
(0.033)
0.203
(0.044)
0.204
(0.044)
0.190
(0.054)
0.195
(0.058)
0.199
(0.065)
0.200
(0.066)
Cluster R.E.
0.034
(0.020)
0.034
(0.020)
0.060
(0.032)
0.063
(0.030)
0.115
(0.055)
0.114
(0.057)
0.192
(0.023)
0.192
(0.023)
0.186
(0.025)
0.183
(0.024)
0.169
(0.026)
0.167
(0.025)
% Conv.
98.8%
100%
99.8%
100%
100%
100%
Table 5: Simulation results for case two.
Conclusion
We extended both the multinominal sampling approach and the multivariate normal approaches to simulating dependent binary data to account for desired random effect structures. The extensions for both methods are simple to implement and offer control of marginal probabilities, dependence between outcomes, and intracluster variability. Rather than being assigned constant values, the desired marginal probabilities are sampled from specified probability distributions, where a separate simulation template is simulated for each cluster. Simulation studies show that our extension to both approaches yields data that achieve the desired marginal probabilities with relatively low variability, and also exhibits the desired correlation between the binary repeated measures. The parameters for the distributions used in the simulation template can also be adjusted to achieve a desired inter-cluster variability.
One limitation in the presentation of this research is that the simulation templates used here are not exhaustive. In both examples offered we only considered cases of two repeated measures and we presented a limited selection of marginal means and correlations. However, extending this approach to account for more repeated measures or alternative simulation templates is straightforward. We also did not consider more complicated random effect structures, though the underlying principle remains the same: randomly generate a simulation template for each cluster or combination of clusters. This general idea can be applied to other simulation approaches for simulating dependent binary data (e.g., Qaqish [8]), and in principle can be adapted in simulation methodologies for other types of dependent outcomes.
An important statistical role in the preparation of clustered study designs is determining the sample size required to find a desired effect size. While equations or numerical procedures for estimating a required sample size are available for some situations (e.g., Donner, Birkett and Buck [9]), more complicated situations involving repeated measures and intricate clustering may require a simulationbased approach. Simulation templates can be designed to match the desired effect size and clustering structure, and empirical power can be estimated by repeatedly simulating such data. In a similar manner, new statistical methodologies suitable for repeated binary outcomes in clustered settings can be numerically assessed and compared with alternative procedures. Data can be simulated from a desired template and analyzed by the methodologies under consideration, and key features from that analysis (e.g., means, test statistics, confidence intervals, and hypothesis testing decisions) can be aggregated over repeated simulations and compared between competing models.
Acknowledgement
The authors are thankful for the kindness and guidance offered by Dr. Karl Peace in the conduct of this research and writing of this manuscript.
References
- Emrich LJ, Piedmonte MR. A method for generating high-dimensional multivariate binary variates. The American Statistician. 1991; 45: 302-304.
- Haynes ME, Sabo RT, Chaganty NR. Simulating dependent binary variables through multinomial sampling. Journal of Statistical Computation and Simulation. 2015.
- Sabo RT, Haynes ME, Chaganty NR. Using odds ratios to simulate dependent binary outcomes. Communications in Statistics: Simulation and Computation. 2015.
- Murray DM, Perry CL, Griffin G, Harty KC, Jacobs DR, Schmid L, et al. Results from a statewide approach to adolescent tobacco use prevention. Prev Med. 1992; 21: 449-472.
- Krist AH, Aycock RA, Etz RS, Devoe JE, Sabo RT, Williams R, et al. My Preventive Care: implementation and dissemination of an interactive preventive health record in three practice-based research networks serving disadvantaged patients. Implementation Science. 2014; 9: 181.
- Joe H. Multivariate Models and Dependence Concepts. London: Chapman and Hall. 1997.
- Chaganty NR, Joe H. Range of correlation matrices for dependent Bernoulli random variables. Biometrika. 2006; 93: 197-206.
- Qaqish BF. A family of multivariate binary distributions for simulating correlated binary variables with specified marginal means and correlations. Biometrika. 2003; 90: 455-463.
- Donner A, Birkett N, Buck C. Randomization by cluster. Sample size requirements and analysis. Am J Epidemiol. 1981; 114: 906-914.