Introduction to survey sampling with R

Why sampling?

In most cases a complete enumeration is not possible, but if done properly, properties of a population can be inferred from a randomly selected sample

General approach

Overall planning of the survey
Define the population of interest
Constructing the sampling frame
Choice of sampling design
Selection of sample
Data collection
Data editing and imputation
Estimation
Dissemination

Sampling error

The fact that not all units go into the inference introduces an uncertainty about the estimates which we call sampling error
The sampling error is a consequence of the fact that we would get (slightly) different results if we could repeatedly draw new samples
Given a sampling design the sampling error will generally be lower if the units in the population are alike (low variance) or the sample size is bigger
Note: For the same sample size the sampling error will be smaller if a suitable design is chosen

Simple demonstration of sampling error

# synthetic population
set.seed(236542)
N <- 1000
yi <- rnorm(N, mean=5000, s=1000)

# attributes at population level
mean (yi)

## [1] 4965.94

hist (yi)

# take a sample of size n
n <- 10
s <- sample(1:length(yi), n, replace = FALSE)

# mean of sample
mean(yi[s])

## [1] 4729.023

mean(yi)

## [1] 4965.94

# repeat sampling K times
K <- 1000
mu_hat <- numeric(K)

for (i in 1:K){
  s <- sample(1:length(yi), n, replace = FALSE)
  mu_hat[i] <- mean(yi[s])
}

mean(mu_hat)

## [1] 4964.755

# sampling distribution
hist(mu_hat)

Playing with population variance and sample size…

Representativity

Representativity is not dependent upon the sample size
For a sample to be representative, all units must have the possibility to be in the sample (inclusion probability > 0)
The intuitive meaning of a representative sample normally is that has a certain size, but also that the marginal properties of the sample resemble those of the population
- Certain procedures exist that makes a balanced sample in exactly this meaning
- A balanced sample is also said to be well spread and actual measures for this have been constructed (see eg. Grafström and Tillé)

SRS: Simple random sampling

SRS is one of the fundamental building block
Assume a population of size \(N\) and that a sampling frame (a list of units) is available
Randomly select \(n\) out of \(N\) units without replacement
The implied inclusion probability is \(\pi = n/N\)
Note: This is equivalent to considering the \(C(N,n)\) possible \(n\)-subsets and randomly select one of these as the sample!

SRS and the HT-estimator

Assume a population with \(N\) units and that the units have some quantitative attribute \(y\)
We seek to estimate the total sum of this attribute over the entire population: \[ \tau_y = \sum _{j=1}^N y_j\]
For this purpose we draw a sample of size \(n\) by simple random sampling (SRS)
The inclusion probability \(\pi\) for each unit is \(n/N\)

We can now estimate \(\tau_y\) by applying the Horvitz-Thomson estimator
Every single observation from the sample is weighted by the inverse of the inclusion probability: \[\hat{\tau}_y = \sum_{j=1}^n \frac{N}{n}y_j\]

The variance of \(\hat{\tau}_y\) is given by \[ V(\hat{\tau}_y) = N^2 (1-\frac{n}{N}) \frac{\hat{s}^2}{n} \] where \(\hat{s}^2\) is an estimate for the variance of \(y\) in the population \(V(y) = \sigma^2\)
The factor \((1-n/N)\) is the finite population correction — this quantity can be ignored when the sampling fraction \(n/N\) is very small
We normally estimate \(s\) from the sample by the standard deviation \[ s = \sqrt{\frac{1}{n-1}\sum_{j=1}^{n}(y_i - \bar{y})^2} \]

Normally we are more interested in the square root of the variance of the estimate — this quantity is called the standard error
The standard error is conveniently measured on the same scale as the estimate
We can also calculate the coefficient of variation (or relative standard error) \[ CV(\hat{\tau}) = \frac{s_\hat{\tau}}{\hat{\tau}} \]

Note: If instead we wanted to estimate the mean \(\mu\) of \(y\) in the population then we simply divide by the population size \(N\) which yields \[ \hat{\mu}_y = \frac{1}{N} \hat{\tau}_y = \sum_{i=1}^n \frac{1}{n}y_i \] This is simply the sample mean
Consequently the variance for the estimated mean can be found as \[ V(\hat{\mu}_y) = V(\frac{1}{N}\hat{\tau}_y) = \frac{1}{N^2}V(\hat{\tau}_y) = (1-\frac{n}{N}) \frac{\hat{s}^2}{n} \]
Note that \(CV(\hat{\tau}) = CV (\hat{\mu}_y)\)

Application of the HT-estimator in R

Example from Cochran (pp. 27-28)

We start by entering the data, which can be done in any number of ways

yi <- c(rep(42,23), rep(41,4), 36, 32, 29, rep(27,2), 23, 19, rep(16,2), 
        rep(15,2), 14, 11, 10, 9, 7, rep(6,3), rep(5,2), 4, 3)

We also need to declare \(N\) and \(n\): The former is hard coded while the latter is derived from the \(y_i\)’s

N <- 676
(n <- length(yi))

## [1] 50

hist(yi, breaks=seq(0,42,2))

An estimate for the total number of votes is calculated as \(\hat{Y} = N\frac{\sum y_i}{n}\):

(Yhat <- N * sum(yi) / n)

## [1] 19887.92

The 80 pct. confidence limits can be calculated by hand as follows

Yhat + c(-1,1) * qnorm(0.9) * N * sd(yi) * sqrt(1-n/N) / sqrt(n)

## [1] 18103.84 21672.00

This corresponds to the way the calculations are presented in Cochran…

An easier approach is to use the survey package written by Thomas Lumley (@tslumley)

library(survey)

In order to use survey data has to be in a data frame

sample.data <- data.frame(i=1:n, y=yi, N=N)
head(sample.data, n=5)

##   i  y   N
## 1 1 42 676
## 2 2 42 676
## 3 3 42 676
## 4 4 42 676
## 5 5 42 676

We now create a survey object using the function svydesign
The object is created by describing the survey design and point to the data
In our case the function requires a minimal number of argument since the design is very simple

srs.design <- svydesign(id=~1,
                        fpc=~N,
                        data=sample.data)

There is a summary method for the class of survey objects

summary(srs.design)

## Independent Sampling design
## svydesign(id = ~1, fpc = ~N, data = sample.data)
## Probabilities:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07396 0.07396 0.07396 0.07396 0.07396 0.07396 
## Population size (PSUs): 676 
## Data variables:
## [1] "i" "y" "N"

This is very useful with more complicated designs

You can now estimate the total \(\hat{Y}\) in a very simple way – notice that the standard error is automatically reported as well

(Yhat <- svytotal(~y, srs.design))

##   total     SE
## y 19888 1392.1

The requested confidence limits can be derived using the standard error, but it is more convenient to use the function made specifically for this purpose

confint(Yhat, level=0.80)

##       10 %  90 %
## y 18103.84 21672