R/add_correlated_data.R
addCorGen.Rd
Create multivariate (correlated) data - for general distributions
addCorGen(
dtOld,
nvars = NULL,
idvar = "id",
rho = NULL,
corstr = NULL,
corMatrix = NULL,
dist,
param1,
param2 = NULL,
cnames = NULL,
method = "copula",
...
)
The data set that will be augmented. If the data set includes a single record per id, the new data table will be created as a "wide" data set. If the original data set includes multiple records per id, the new data set will be in "long" format.
The number of new variables to create for each id. This is only applicable when the data are generated from a data set that includes one record per id.
String variable name of column represents individual level id for correlated data.
Correlation coefficient, -1 <= rho <= 1. Use if corMatrix is not provided.
Correlation structure of the variance-covariance matrix defined by sigma and rho. Options include "cs" for a compound symmetry structure and "ar1" for an autoregressive structure.
Correlation matrix can be entered directly. It must be symmetrical and positive semi-definite. It is not a required field; if a matrix is not provided, then a structure and correlation coefficient rho must be specified.
A string indicating "normal", "binary", "poisson" or "gamma".
A string that represents the column in dtOld that contains the parameter for the mean of the distribution. In the case of the uniform distribution the column specifies the minimum.
A string that represents the column in dtOld that contains a possible second parameter for the distribution. For the normal distribution, this will be the variance; for the gamma distribution, this will be the dispersion; and for the uniform distribution, this will be the maximum.
Explicit column names. A single string with names separated by commas. If no string is provided, the default names will be V#, where # represents the column.
Two methods are available to generate correlated data. (1) "copula" uses the multivariate Gaussian copula method that is applied to all other distributions; this applies to all available distributions. (2) "ep" uses an algorithm developed by Emrich and Piedmonte (1991).
May include additional arguments that have been deprecated and are no longer used.
Original data.table with added column(s) of correlated data
The original data table can come in one of two formats: a single row per idvar (where data are ungrouped) or multiple rows per idvar (in which case the data are grouped or clustered). The structure of the arguments depends on the format of the data.
In the case of ungrouped data, there are two ways to specify the number of correlated variables and the covariance matrix. In approach (1), nvars needs to be specified along with rho and corstr. In approach (2), corMatrix may be specified by identifying a single square n x n covariance matrix. The number of new variables generated for each record will be n. If nvars, rho, corstr, and corMatrix are all specified, the data will be generated based on the information provided in the covariance matrix alone. In both (1) and (2), the data will be returned in a wide format.
In the case of grouped data, where there are G groups, there are also two ways to proceed. In both cases, the number of new variables to be generated may vary by group, and will be determined by the number of records in each group, \(n_i, i \in \{1,...,G\}\) (i.e., the number of records that share the same value of idvar). nvars is not used in grouped data. In approach (1), the arguments rho and corstr may both be specified to determine the structure of the covariance matrix. In approach (2), the argument corMatrix may be specified. corMatrix can be a single matrix with dimensions \(n \ \text{x} \ n\) if \(n_i = n\) for all i. However, if the sample sizes of each group vary (i.e., \(n_i \ne n_j\) for some groups i and j), corMatrix must be a list of covariance matrices with a length G; each covariance matrix in the list will have dimensions \(n_i \ \text{x} \ n_i, \ i \in \{1,...,G\}\). In the case of grouped data, the new data will be returned in long format (i.e., one new column only).
Emrich LJ, Piedmonte MR. A Method for Generating High-Dimensional Multivariate Binary Variates. The American Statistician 1991;45:302-4.
# Ungrouped data
cMat <- genCorMat(nvars = 4, rho = .2, corstr = "ar1", nclusters = 1)
def <-
defData(varname = "xbase", formula = 5, variance = .4, dist = "gamma") |>
defData(varname = "lambda", formula = ".5 + .1*xbase", dist = "nonrandom", link = "log") |>
defData(varname = "n", formula = 3, dist = "noZeroPoisson")
dd <- genData(101, def, id = "cid")
## Specify with nvars, rho, and corstr
addCorGen(
dtOld = dd, idvar = "cid", nvars = 3, rho = .7, corstr = "cs",
dist = "poisson", param1 = "lambda"
)
#> Key: <cid>
#> cid xbase lambda n V1 V2 V3
#> <int> <num> <num> <num> <num> <num> <num>
#> 1: 1 2.464257 2.109447 1 4 6 5
#> 2: 2 8.193663 3.741050 1 2 2 4
#> 3: 3 1.616013 1.937893 3 2 3 2
#> 4: 4 3.972252 2.452788 3 6 3 4
#> 5: 5 10.439773 4.683179 2 2 4 2
#> ---
#> 97: 97 1.172740 1.853868 5 2 2 2
#> 98: 98 3.251514 2.282226 3 1 0 2
#> 99: 99 7.473232 3.481012 3 1 3 3
#> 100: 100 2.156320 2.045479 2 6 3 3
#> 101: 101 3.476790 2.334223 4 0 1 0
## Specify with covMatrix
addCorGen(
dtOld = dd, idvar = "cid", corMatrix = cMat,
dist = "poisson", param1 = "lambda"
)
#> Key: <cid>
#> cid xbase lambda n V1 V2 V3 V4
#> <int> <num> <num> <num> <num> <num> <num> <num>
#> 1: 1 2.464257 2.109447 1 2 2 2 0
#> 2: 2 8.193663 3.741050 1 12 3 4 2
#> 3: 3 1.616013 1.937893 3 2 1 2 0
#> 4: 4 3.972252 2.452788 3 0 0 1 5
#> 5: 5 10.439773 4.683179 2 1 5 5 5
#> ---
#> 97: 97 1.172740 1.853868 5 3 3 2 0
#> 98: 98 3.251514 2.282226 3 3 3 4 4
#> 99: 99 7.473232 3.481012 3 2 1 5 3
#> 100: 100 2.156320 2.045479 2 3 1 3 3
#> 101: 101 3.476790 2.334223 4 2 2 1 2
# Grouped data
cMats <- genCorMat(nvars = dd$n, rho = .5, corstr = "cs", nclusters = nrow(dd))
dx <- genCluster(dd, "cid", "n", "id")
## Specify with nvars, rho, and corstr
addCorGen(
dtOld = dx, idvar = "cid", rho = .8, corstr = "ar1", dist = "poisson", param1 = "xbase"
)
#> Key: <cid>
#> cid xbase lambda n id X
#> <int> <num> <num> <num> <int> <num>
#> 1: 1 2.464257 2.109447 1 1 2
#> 2: 2 8.193663 3.741050 1 2 6
#> 3: 3 1.616013 1.937893 3 3 1
#> 4: 3 1.616013 1.937893 3 4 1
#> 5: 3 1.616013 1.937893 3 5 1
#> ---
#> 299: 100 2.156320 2.045479 2 299 3
#> 300: 101 3.476790 2.334223 4 300 7
#> 301: 101 3.476790 2.334223 4 301 7
#> 302: 101 3.476790 2.334223 4 302 8
#> 303: 101 3.476790 2.334223 4 303 6
## Specify with covMatrix
addCorGen(
dtOld = dx, idvar = "cid", corMatrix = cMats, dist = "poisson", param1 = "xbase"
)
#> Key: <cid>
#> cid xbase lambda n id X
#> <int> <num> <num> <num> <int> <num>
#> 1: 1 2.464257 2.109447 1 1 5
#> 2: 2 8.193663 3.741050 1 2 11
#> 3: 3 1.616013 1.937893 3 3 0
#> 4: 3 1.616013 1.937893 3 4 0
#> 5: 3 1.616013 1.937893 3 5 2
#> ---
#> 299: 100 2.156320 2.045479 2 299 4
#> 300: 101 3.476790 2.334223 4 300 6
#> 301: 101 3.476790 2.334223 4 301 3
#> 302: 101 3.476790 2.334223 4 302 3
#> 303: 101 3.476790 2.334223 4 303 3