r - Create a fake dataset that fits the following parameters: N, mean, sd, min, and max -


is there way create fake dataset fits following parameters: n, mean, sd, min, , max?

i want create sample of 187 integer scale scores have mean of 67 , standard deviation of 17, observations within range [30, 210]. i'm trying demonstrate conceptual lesson statistical power, , create data distribution looks published result. scale score in example sum of 30 items each range 1 7. don't need data individual items make scale score, bonus.

i know use rnorm(), values not integers, , min , max can exceed possible values.

scalescore <- rnorm(187, mean = 67, sd = 17) 

i know use sample() integers stay within range, mean , standard deviation won't right.

scalescore <- sample(30:210, 187, replace=true) 

@pascal's tip led me urnorm() in runuran package:

set.seed(5) scalescore <- urnorm(n=187, mean=67, sd=17, lb=30, ub=210) mean(scalescore) # [1] 68.51758 sd(scalescore) # [1] 16.38056 min(scalescore) # [1] 32.15726 max(scalescore) # [1] 107.6758 

mean , sd not exact, of course, , vector not consist of integers.

any other options?

integer optimization no template

since want have exact mean, standard deviation, min, , max, first choice wouldn't random number generation, since sample unlikely match mean , standard deviation of distribution you're drawing from. instead, take integer optimization approach. define variable x_i number of times integer i appears in sample. you'll define decision variables x_30, x_31, ..., x_210 , add constraints ensure conditions met:

  • 187 samples: can encoded constraint x_30 + x_31 + ... + x_210 = 187
  • mean of 67: can encoded constraint 30*x_30 + 31*x_31 + ... + 210*x_210 = 187 * 67
  • logical constraints on variables: variables must take non-negative integer values
  • "looks real data" ill-defined concept, require frequency of adjacent numbers have difference of no more 1. linear constraints of form x_30 - x_31 <= 1, x_30 - x_31 >= -1, , on every consecutive pair. can require each frequency not exceed arbitrarily defined upper bound (i'll use 10).

finally, want standard deviation close 17 possible, meaning want variance close possible 17^2 = 289. can define variable y upper bound on how closely match variance, , can minimize y:

y >= ((30-67)^2 * x_30 + (31-67)^2 * x_31 + ... + (210-67)^2 * x_210) - (289 * (187-1)) y >= -((30-67)^2 * x_30 + (31-67)^2 * x_31 + ... + (210-67)^2 * x_210) + (289 * (187-1)) 

this pretty easy optimization problem solve solver lpsolve:

library(lpsolve) get.sample <- function(n, avg, stdev, lb, ub) {   vals <- lb:ub   nv <- length(vals)   mod <- lp(direction = "min",             objective.in = c(rep(0, nv), 1),             const.mat = rbind(c(rep(1, nv), 0),                               c(vals, 0),                               c(-(vals-avg)^2, 1),                               c((vals-avg)^2, 1),                               cbind(diag(nv), rep(0, nv)),                               cbind(diag(nv)-cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv)),                               cbind(diag(nv)-cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv))),             const.dir = c("=", "=", ">=", ">=", rep("<=", nv), rep("<=", nv), rep(">=", nv)),             const.rhs = c(n, avg*n, -stdev^2 * (n-1), stdev^2 * (n-1), rep(10, nv), rep(1, nv), rep(-1, nv)),             all.int = true)   rep(vals, head(mod$solution, -1)) } samp <- get.sample(187, 67, 17, 30, 210) summary(samp) #    min. 1st qu.  median    mean 3rd qu.    max.  #      30      64      69      67      74     119 sd(samp) # [1] 17 plot(table(samp)) 

enter image description here

for parameters provided, able exact mean , standard deviation while returning integer values, , computation completed in computer in 0.4 seconds.

integer optimization template

another approach getting resembles "real data" define starting continuous distribution (e.g. result of urnorm function include in original post) , round values integers in way best achieves mean , standard deviation objectives. introduces 2 new classes of constraints: upper bound on number of samples @ value number of samples either rounded or down achieve value , lower bound on sum of 2 consecutive frequencies number of continuous samples fall between 2 integers. again, easy implement lpsolve , not terribly inefficient run:

library(lpsolve) get.sample2 <- function(n, avg, stdev, lb, ub, init.dist) {   vals <- lb:ub   nv <- length(vals)   lims <- as.vector(table(factor(c(floor(init.dist), ceiling(init.dist)), vals)))   floors <- as.vector(table(factor(c(floor(init.dist)), vals)))   mod <- lp(direction = "min",             objective.in = c(rep(0, nv), 1),             const.mat = rbind(c(rep(1, nv), 0),                               c(vals, 0),                               c(-(vals-avg)^2, 1),                               c((vals-avg)^2, 1),                               cbind(diag(nv), rep(0, nv)),                               cbind(diag(nv) + cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv))),             const.dir = c("=", "=", ">=", ">=", rep("<=", nv), rep(">=", nv)),             const.rhs = c(n, avg*n, -stdev^2 * (n-1), stdev^2 * (n-1), lims, floors),             all.int = true)   rep(vals, head(mod$solution, -1)) }  library(runuran) set.seed(5) init.dist <- urnorm(n=187, mean=67, sd=17, lb=30, ub=210) samp2 <- get.sample2(187, 67, 17, 30, 210, init.dist) summary(samp2) #    min. 1st qu.  median    mean 3rd qu.    max.  #      32      57      66      67      77     107 sd(samp2) # [1] 17 plot(table(samp2)) 

enter image description here

this approach faster (under 0.1 seconds) , still returns distribution meets required mean , standard deviation. further, given sufficiently high quality samples continuous distributions, can used distributions of different shapes take integer values , meet required statistical properties.


Comments