Introduction to model assisted estimation

Model assisted estimation

Sometimes we have auxiliary information available, ie. information \(x\) known for both the sampled units but also at the population level \(\tau_x\)
Trivial example: Export is a fixed ratio of the turnover \[ \hat{\tau}_y^{ratio} = \frac{\tau_x}{\hat{\tau}_x^{HT}}\hat{\tau}_y^{HT} \]

The regression estimate

The purpose of the regression estimate is to use an auxiliary variable \(x\) to get a better estimate of our variable of interest \(y\)
Better means either
- smaller standard error
- or reduction of bias
One can show that the regression etimate may be though of as an adjusted HT-estimate:

\[ \hat{\tau}_y^{GREG} = \hat{\tau}_y^{HT} + (\tau_x - \hat{\tau}_x)\hat{\beta} \]

Assume we know the totals of the auxiliary variable at the population level \[ \tau_x = \sum_{i \in \mathcal{U}}x_i \]

We could also estimate these quantities from the sample applying the HT-estimator \[ \hat{\tau}_x^{HT}=\sum_{i \in \mathcal{S}}d_i x_i \] where \(d_i\) is the design weight for sample unit \(i\), ie. \(\pi _i^{-1}\)

Hopefully \(\hat{\tau}_x^{HT} \approx \tau_x\) but generally the two quantities will not be exactly equal

In principle, this is unimportant, since we already know the true value
But since the design weight \(d\) is also used for estimating totals for our variables of interest \(y\), we must expect these to be a little off, too – especially if \(y\) and \(x\) are correlated
Hence, we seek a calibrated weight \(w_i\) such that the following calibration equation is satisfied \[ \sum_{i \in \mathcal{S}} w_i x_i = \tau_x \]

The calibrated weights should ideally be close to the original design weights, so the calibrated weights are found as the solution to a minimization problem: \[ \min \sum_{i \in \mathcal{S}} G(w_i,d_i) \quad \textrm{subject to} \quad \sum_{i \in \mathcal{S}} w_i x_i = \tau_x \]

\(G\) is a function measuring the distance between the design weight and the calibrated weight – for the regression estimate we use a quadratic distance function \[ G(w,d) = \frac{(w-d)^2}{2d} \]

Now the regression estimate for our variable of interest \(y\) can simply be calculated as a weighted sum \[ \hat{\tau}_y^{GREG} = \sum_{i \in \mathcal{S}} w_i y_i \]

The calibrated weight can be regarded as the design weight \(d\) multiplied with a \(g\)-weight (a factor introducing individual corrections) \[ w_i = g_i d_i \Rightarrow g_i = \frac{w_i}{d_i} \]

Always check the \(g\)-weights – if there are extreme values check again what you are trying to accomplish with the calibration equations

A third (and final) interpretation of the GREG-estimate (giving more prominence to an underlying linear model) is as the sum of predicted (or fitted) values for the entire population plus the design weighted residuals \[ \hat{\tau}_y^{GREG} = \sum_{i \in \mathcal{U}} \hat{y}_i + \sum_{i \in \mathcal{S}} d_i (y_i - \hat{y}_i) \]

Calculating \(V(\hat{\tau}_y^{GREG})\) involves approximation by Taylor expansion and is generally best left to some suitable package in R…

The regression estimate with R

We create a small (\(N=10\)) and very simple population with two variables and no strata
Variable \(x\) is an auxiliary variable available from a register (say land area measured in hectare), while \(y\) is our main variable of interest only available from a sample (say crop production measured in ton)

my.pop <- data.frame(id=1:10,
                      x=c(99,111,114,95,110,102,94,83,96,102),
                      y=c(647,654,674,637,664,653,644,625,654,647))

There is a strong positive correlation between \(x\) and \(y\) (\(\rho =\) 0.9032489)

Now we draw a simple random sample (SRS) of size 3 and estimate both \(\tau_x\) and \(\tau_y\) using the survey package:

s <- c(1,3,8)
my.sample <- my.pop[s,]
my.sample$N <- nrow(my.pop)

library(survey)
srs.svyobj <- svydesign(ids = ~1,
                       fpc = ~N,
                       data = my.sample)

(y.hat <- svytotal (~y, srs.svyobj))

##    total     SE
## y 6486.7 118.55

(x.hat <- svytotal (~x, srs.svyobj))

##    total     SE
## x 986.67 74.885

sum(my.pop$x)

## [1] 1006

We now turn to calibration (the regression estimate) and start by setting up a target consisting of the known population size and the known total of \(x\)

(cgoal <- c('(Intercept)'=nrow(my.pop), 'x'=sum(my.pop$x)))

## (Intercept)           x 
##          10        1006

Don’t worry too much about using named vectors – it is just syntax…

The calibration is done by an aptly named function

cal.svyobj <- calibrate (srs.svyobj,
                         formula=~x,
                         population=cgoal)

We use the new survey object to calculate a regression estimate of \(\tau_x\), although the result is known by definition in advance

(x.hat.cal <- svytotal (~x, cal.svyobj))

##   total SE
## x  1006  0

More interesting, we calculate the regression estimate of \(y\)

(y.hat.cal <- svytotal (~y, cal.svyobj))

##    total     SE
## y 6517.2 9.2829

Remember that the true (but principally unknown) value of \(\tau_y\) is 6499 – in this case the power from the correlation between \(x\) and \(y\) is used to reduce the standard error rather than the bias

We can get the calibrated weight (and also the \(g\)-weights) to gain some insight

my.sample$dw <- nrow(my.pop) / nrow(my.sample)
my.sample$cw <- weights(cal.svyobj)
my.sample$gw <- my.sample$cw / my.sample$dw

my.sample

##   id   x   y  N       dw       cw       gw
## 1  1  99 647 10 3.333333 3.346741 1.004022
## 3  3 114 674 10 3.333333 3.950069 1.185021
## 8  8  83 625 10 3.333333 2.703190 0.810957

\(\tau_x\) was estimated slightly smaller than the target value with the HT-estimator – therefore \(g>1\) for the largest unit (in terms of \(x\)) while \(g<1\) for the smallest unit

Also note the following proporties

sum(my.sample$dw)

## [1] 10

sum(my.sample$cw)

## [1] 10

mean(my.sample$gw)

## [1] 1

Now imagine that an updated frame population shows that the true number of units is 12 rather than 10, and that the true total value of x is 1100 hectares rather than 1006
Adjusting the target to these figures allows us to raise the sample to the level of the updated frame population

cgoal.new <- c('(Intercept)'=12,'x'=1100)

cal.svyobj.new <- calibrate (srs.svyobj, formula=~x, 
                   population=cgoal.new)
(x.hat.cal.new <- svytotal (~x, cal.svyobj.new))

##   total SE
## x  1100  0

(y.hat.cal.new <- svytotal (~y, cal.svyobj.new))

##    total     SE
## y 7651.4 11.595

cv(y.hat.cal.new)

##             y
## y 0.001515457

The result of using the target from the updated frame population that the estimate of \(\tau_y\) has been increased significantly
Notice also that the CV is also higher –- raising the estimates relative to the original frame comes at a cost

Finally we may derive and inspect the new correction weights.

##   id   x   y  N       dw       cw       gw      cw2       gw2
## 1  1  99 647 10 3.333333 3.346741 1.004022 3.941748 1.1825243
## 3  3 114 674 10 3.333333 3.950069 1.185021 1.320388 0.3961165
## 8  8  83 625 10 3.333333 2.703190 0.810957 6.737864 2.0213592

sum(my.sample$cw2)

## [1] 12

mean(my.sample$gw2)

## [1] 1.2