is there way create fake dataset fits following parameters: n, mean, sd, min, , max?
i want create sample of 187 integer scale scores have mean of 67 , standard deviation of 17, observations within range [30, 210]. i'm trying demonstrate conceptual lesson statistical power, , create data distribution looks published result. scale score in example sum of 30 items each range 1 7. don't need data individual items make scale score, bonus.
i know use rnorm()
, values not integers, , min , max can exceed possible values.
scalescore <- rnorm(187, mean = 67, sd = 17)
i know use sample()
integers stay within range, mean , standard deviation won't right.
scalescore <- sample(30:210, 187, replace=true)
@pascal's tip led me urnorm()
in runuran
package:
set.seed(5) scalescore <- urnorm(n=187, mean=67, sd=17, lb=30, ub=210) mean(scalescore) # [1] 68.51758 sd(scalescore) # [1] 16.38056 min(scalescore) # [1] 32.15726 max(scalescore) # [1] 107.6758
mean , sd not exact, of course, , vector not consist of integers.
any other options?
integer optimization no template
since want have exact mean, standard deviation, min, , max, first choice wouldn't random number generation, since sample unlikely match mean , standard deviation of distribution you're drawing from. instead, take integer optimization approach. define variable x_i
number of times integer i
appears in sample. you'll define decision variables x_30
, x_31
, ..., x_210
, add constraints ensure conditions met:
- 187 samples: can encoded constraint
x_30 + x_31 + ... + x_210 = 187
- mean of 67: can encoded constraint
30*x_30 + 31*x_31 + ... + 210*x_210 = 187 * 67
- logical constraints on variables: variables must take non-negative integer values
- "looks real data" ill-defined concept, require frequency of adjacent numbers have difference of no more 1. linear constraints of form
x_30 - x_31 <= 1
,x_30 - x_31 >= -1
, , on every consecutive pair. can require each frequency not exceed arbitrarily defined upper bound (i'll use 10).
finally, want standard deviation close 17 possible, meaning want variance close possible 17^2 = 289. can define variable y
upper bound on how closely match variance, , can minimize y:
y >= ((30-67)^2 * x_30 + (31-67)^2 * x_31 + ... + (210-67)^2 * x_210) - (289 * (187-1)) y >= -((30-67)^2 * x_30 + (31-67)^2 * x_31 + ... + (210-67)^2 * x_210) + (289 * (187-1))
this pretty easy optimization problem solve solver lpsolve
:
library(lpsolve) get.sample <- function(n, avg, stdev, lb, ub) { vals <- lb:ub nv <- length(vals) mod <- lp(direction = "min", objective.in = c(rep(0, nv), 1), const.mat = rbind(c(rep(1, nv), 0), c(vals, 0), c(-(vals-avg)^2, 1), c((vals-avg)^2, 1), cbind(diag(nv), rep(0, nv)), cbind(diag(nv)-cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv)), cbind(diag(nv)-cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv))), const.dir = c("=", "=", ">=", ">=", rep("<=", nv), rep("<=", nv), rep(">=", nv)), const.rhs = c(n, avg*n, -stdev^2 * (n-1), stdev^2 * (n-1), rep(10, nv), rep(1, nv), rep(-1, nv)), all.int = true) rep(vals, head(mod$solution, -1)) } samp <- get.sample(187, 67, 17, 30, 210) summary(samp) # min. 1st qu. median mean 3rd qu. max. # 30 64 69 67 74 119 sd(samp) # [1] 17 plot(table(samp))
for parameters provided, able exact mean , standard deviation while returning integer values, , computation completed in computer in 0.4 seconds.
integer optimization template
another approach getting resembles "real data" define starting continuous distribution (e.g. result of urnorm
function include in original post) , round values integers in way best achieves mean , standard deviation objectives. introduces 2 new classes of constraints: upper bound on number of samples @ value number of samples either rounded or down achieve value , lower bound on sum of 2 consecutive frequencies number of continuous samples fall between 2 integers. again, easy implement lpsolve , not terribly inefficient run:
library(lpsolve) get.sample2 <- function(n, avg, stdev, lb, ub, init.dist) { vals <- lb:ub nv <- length(vals) lims <- as.vector(table(factor(c(floor(init.dist), ceiling(init.dist)), vals))) floors <- as.vector(table(factor(c(floor(init.dist)), vals))) mod <- lp(direction = "min", objective.in = c(rep(0, nv), 1), const.mat = rbind(c(rep(1, nv), 0), c(vals, 0), c(-(vals-avg)^2, 1), c((vals-avg)^2, 1), cbind(diag(nv), rep(0, nv)), cbind(diag(nv) + cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv))), const.dir = c("=", "=", ">=", ">=", rep("<=", nv), rep(">=", nv)), const.rhs = c(n, avg*n, -stdev^2 * (n-1), stdev^2 * (n-1), lims, floors), all.int = true) rep(vals, head(mod$solution, -1)) } library(runuran) set.seed(5) init.dist <- urnorm(n=187, mean=67, sd=17, lb=30, ub=210) samp2 <- get.sample2(187, 67, 17, 30, 210, init.dist) summary(samp2) # min. 1st qu. median mean 3rd qu. max. # 32 57 66 67 77 107 sd(samp2) # [1] 17 plot(table(samp2))
this approach faster (under 0.1 seconds) , still returns distribution meets required mean , standard deviation. further, given sufficiently high quality samples continuous distributions, can used distributions of different shapes take integer values , meet required statistical properties.
Comments
Post a Comment