Why sampling?

  • In most cases a complete enumeration is not possible, but if done properly, properties of a population can be inferred from a randomly selected sample

General approach

  • Overall planning of the survey
  • Define the population of interest
  • Constructing the sampling frame
  • Choice of sampling design
  • Selection of sample
  • Data collection
  • Data editing and imputation
  • Estimation
  • Dissemination

Sampling error

  • The fact that not all units go into the inference introduces an uncertainty about the estimates which we call sampling error
  • The sampling error is a consequence of the fact that we would get (slightly) different results if we could repeatedly draw new samples
  • Given a sampling design the sampling error will generally be lower if the units in the population are alike (low variance) or the sample size is bigger
  • Note: For the same sample size the sampling error will be smaller if a suitable design is chosen

Simple demonstration of sampling error

# synthetic population
set.seed(236542)
N <- 1000
yi <- rnorm(N, mean=5000, s=1000)

# attributes at population level
mean (yi)
## [1] 4965.94

hist (yi)

# take a sample of size n
n <- 10
s <- sample(1:length(yi), n, replace = FALSE)

# mean of sample
mean(yi[s])
## [1] 4729.023
mean(yi)
## [1] 4965.94

# repeat sampling K times
K <- 1000
mu_hat <- numeric(K)

for (i in 1:K){
  s <- sample(1:length(yi), n, replace = FALSE)
  mu_hat[i] <- mean(yi[s])
}

mean(mu_hat)  
## [1] 4964.755

# sampling distribution
hist(mu_hat)  

  • Playing with population variance and sample size…

Representativity

  • Representativity is not dependent upon the sample size
  • For a sample to be representative, all units must have the possibility to be in the sample (inclusion probability > 0)
  • The intuitive meaning of a representative sample normally is that has a certain size, but also that the marginal properties of the sample resemble those of the population
    • Certain procedures exist that makes a balanced sample in exactly this meaning
    • A balanced sample is also said to be well spread and actual measures for this have been constructed (see eg. Grafström and Tillé)

SRS: Simple random sampling

  • SRS is one of the fundamental building block
  • Assume a population of size \(N\) and that a sampling frame (a list of units) is available
  • Randomly select \(n\) out of \(N\) units without replacement
  • The implied inclusion probability is \(\pi = n/N\)
  • Note: This is equivalent to considering the \(C(N,n)\) possible \(n\)-subsets and randomly select one of these as the sample!

SRS and the HT-estimator

  • Assume a population with \(N\) units and that the units have some quantitative attribute \(y\)
  • We seek to estimate the total sum of this attribute over the entire population: \[ \tau_y = \sum _{j=1}^N y_j\]
  • For this purpose we draw a sample of size \(n\) by simple random sampling (SRS)
  • The inclusion probability \(\pi\) for each unit is \(n/N\)

  • We can now estimate \(\tau_y\) by applying the Horvitz-Thomson estimator
  • Every single observation from the sample is weighted by the inverse of the inclusion probability: \[\hat{\tau}_y = \sum_{j=1}^n \frac{N}{n}y_j\]

  • The variance of \(\hat{\tau}_y\) is given by \[ V(\hat{\tau}_y) = N^2 (1-\frac{n}{N}) \frac{\hat{s}^2}{n} \] where \(\hat{s}^2\) is an estimate for the variance of \(y\) in the population \(V(y) = \sigma^2\)
  • The factor \((1-n/N)\) is the finite population correction — this quantity can be ignored when the sampling fraction \(n/N\) is very small
  • We normally estimate \(s\) from the sample by the standard deviation \[ s = \sqrt{\frac{1}{n-1}\sum_{j=1}^{n}(y_i - \bar{y})^2} \]

  • Normally we are more interested in the square root of the variance of the estimate — this quantity is called the standard error
  • The standard error is conveniently measured on the same scale as the estimate
  • We can also calculate the coefficient of variation (or relative standard error) \[ CV(\hat{\tau}) = \frac{s_\hat{\tau}}{\hat{\tau}} \]

  • Note: If instead we wanted to estimate the mean \(\mu\) of \(y\) in the population then we simply divide by the population size \(N\) which yields \[ \hat{\mu}_y = \frac{1}{N} \hat{\tau}_y = \sum_{i=1}^n \frac{1}{n}y_i \] This is simply the sample mean
  • Consequently the variance for the estimated mean can be found as \[ V(\hat{\mu}_y) = V(\frac{1}{N}\hat{\tau}_y) = \frac{1}{N^2}V(\hat{\tau}_y) = (1-\frac{n}{N}) \frac{\hat{s}^2}{n} \]
  • Note that \(CV(\hat{\tau}) = CV (\hat{\mu}_y)\)

Application of the HT-estimator in R

  • Example from Cochran (pp. 27-28)

  • We start by entering the data, which can be done in any number of ways
yi <- c(rep(42,23), rep(41,4), 36, 32, 29, rep(27,2), 23, 19, rep(16,2), 
        rep(15,2), 14, 11, 10, 9, 7, rep(6,3), rep(5,2), 4, 3)
  • We also need to declare \(N\) and \(n\): The former is hard coded while the latter is derived from the \(y_i\)’s
N <- 676
(n <- length(yi))
## [1] 50

hist(yi, breaks=seq(0,42,2))

  • An estimate for the total number of votes is calculated as \(\hat{Y} = N\frac{\sum y_i}{n}\):
(Yhat <- N * sum(yi) / n)
## [1] 19887.92
  • The 80 pct. confidence limits can be calculated by hand as follows
Yhat + c(-1,1) * qnorm(0.9) * N * sd(yi) * sqrt(1-n/N) / sqrt(n) 
## [1] 18103.84 21672.00

  • This corresponds to the way the calculations are presented in Cochran…

  • An easier approach is to use the survey package written by Thomas Lumley (@tslumley)
library(survey)
  • In order to use survey data has to be in a data frame
sample.data <- data.frame(i=1:n, y=yi, N=N)
head(sample.data, n=5)
##   i  y   N
## 1 1 42 676
## 2 2 42 676
## 3 3 42 676
## 4 4 42 676
## 5 5 42 676

  • We now create a survey object using the function svydesign
  • The object is created by describing the survey design and point to the data
  • In our case the function requires a minimal number of argument since the design is very simple
srs.design <- svydesign(id=~1,
                        fpc=~N,
                        data=sample.data)

  • There is a summary method for the class of survey objects
summary(srs.design)
## Independent Sampling design
## svydesign(id = ~1, fpc = ~N, data = sample.data)
## Probabilities:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07396 0.07396 0.07396 0.07396 0.07396 0.07396 
## Population size (PSUs): 676 
## Data variables:
## [1] "i" "y" "N"
  • This is very useful with more complicated designs

  • You can now estimate the total \(\hat{Y}\) in a very simple way – notice that the standard error is automatically reported as well
(Yhat <- svytotal(~y, srs.design))
##   total     SE
## y 19888 1392.1
  • The requested confidence limits can be derived using the standard error, but it is more convenient to use the function made specifically for this purpose
confint(Yhat, level=0.80)
##       10 %  90 %
## y 18103.84 21672