The whSample Package

whSample helps analysts quickly generate statistical samples from Excel or Comma Separated Value (CSV) files and write them to a new Excel workbook. Users have a choice of Simple Random or Stratified Random samples, and a third choice of having each stratum included in a separate worksheet.

See package vignettes for detailed documentation.

ssize

The workhorse function is sampler. A helper function, ssize, estimates the minimum sample size necessary to achieve statistical requirements using a Normal Approximation to the Hypergeometric Distribution. This distribution spans the probabilities of yes/no-type responses without replacement. These parameters are:

• N, the population size.
• ci, the required confidence interval. The default is 95%.
• me, the required level of precision, or margin of error. The default is +/- 7%.
• p, the anticipated rate of occurrence. The default is 50%.

ssize(N, ci=0.95, me=0.07, p=0.50) (showing the defaults) only requires the N argument. Used as a standalone, it can be used to explore sample sizes under other conditions. For example, a probe sample may suggest that a 50-50 probability isn’t realistic. A revised sample size can be estimated with the observed success probability (p=0.6, for example).

sampler

The sampler function calls ssize to get its sample size estimate. Therefore, it requires the ci, me, and p arguments, which it passes to ssize.

sampler also takes four additional arguments:

• irisData opens the file chooser to a folder with example files of Anderston’s Iris dataset of flower characteristics.
• backups provides a buffer for use if necessary to replace samples found to be invalid for some reason,
• seed is used to seed the internal random number generator, and
• keepOrg determines if a copy of the population is included in the output.

The defaults for these additional arguments are backups=5, irisData=F, seed=NULL and keepOrg=F. The default seed will tell sampler to use the current system time in milliseconds. Any number can be used as a seed. Whichever one is used will be listed in the Report output tab. The keep-original option (keepOrg) defaults to FALSE, but could be set to keepOrg=T for smaller populations that wouldn’t exceed Excel’s row limit is 1,048,576 rows.

To override any of these defaults, enter name=value as an argument.

sampler uses a series of menus to guide users through the sampling process.

Output

sampler creates a new Excel workbook with three parts:

• a copy of the original (source) data if previously requested,

• an Excel spreadsheet with the requested sample, and

• a new tab called Report with key reference information:

• path and name of the source file

• size (in rows) of the source file

• sample type (Simple Random Sample, Stratified Random Sample, or Tabbed Stratified Sample)

• sampling parameters

• sample size

• stratification key

• number of strata

• number of backups requested (this number is applied to every stratum in a stratified sample)

• random number seed used, for documentation and reproducibility

• date-time stamp of when the sample was generated

• stratification information (name, number in the population, proportion of the population, and the number of samples)

Installation

You can install whSample from CRAN with:

``install.packages("whSample")``

or get the latest developmental version with:

``devtools::install_github("km4ivi/whSample")``

Other necessary packages

sampler depends on several external packages to run properly. If you’re running a developmental version, make sure these packages are installed on your computer:

• tidyverse (or individually: magrittr, dplyr, purrr)
• openxlsx
• data.table
• tools
• utils
• tcltk
• bit64

Examples

ssize(5000): N=5000, other arguments use defaults

ssize(5000, p=0.60): N=5000, with a 60% expected rate of occurrence

sampler(): Uses all defaults, gets N from the source data.

sampler(backups=2, seed=12345): Overrides specific defaults