Richard Valliant - The Evolution of the Use of Models in Survey Sampling - February 15, 2023
Related Media
Richard Valliant, PhD, is a research professor emeritus at the Institute for Social Research, University of Michigan, and at the Joint Program in Survey Methodology at the University of Maryland. He is a Fellow of the American Statistical Association, an elected member of the International Statistical Institute, and has been an associate editor of the Journal of the American Statistical Association, Journal of Official Statistics, and Survey Methodology.
The Evolution of the Use of Models in Survey Sampling
The use of models in survey estimation has evolved over the last five (or more) decades. This talk will trace some of the developments over time and attempt to review some of the history. Consideration of models for estimating descriptive statistics began as early as the 1940's when Cochran and Jessen proposed linear regression estimators of means. These were early examples of model-assisted estimation since the properties of the Cochran-Jessen estimators were calculated with respect to a random sampling distribution. Model-thinking was used informally through the 1960's to form ratio and linear regression estimators that could in some applications reduce design variances.
In a 1963 Australian Journal of Statistics paper, Brewer presented results for a ratio estimator that were entirely based on a super population model. Royall (Biometrika 1970 and later papers) formalized the theory for a more general prediction approach using linear models. Since that time, the use of models is ubiquitous in the survey estimation literature and has been extended to nonparametric, empirical likelihood, Bayesian, small area, machine learning, and other approaches. There remains a considerable gap between the more advanced techniques in the literature and the methods commonly used in practice.
In parallel to the model developments, the design-based, randomization approach was dominating official statistics in the US largely due to the efforts of Morris Hansen and his colleagues at the US Census Bureau. In 1937 Hansen and others at the Census Bureau designed a follow-on sample survey to a special census of the employed and partially employed because response to the census was incomplete and felt to be inaccurate. The sample estimates were judged to be more trustworthy than those of the census itself. This began Hansen’s career-long devotion to random sampling as the only trustworthy method for obtaining samples from finite populations and for making inferences.
Model-assisted estimation, as discussed in the 1992 book by Särndal, Swensson, and Wretman is a type of compromise where models are used to construct estimators while a randomization distribution is used to compute properties like means and variances. This thinking has led to the popularity of doubly robust approaches where the goal is to have estimators with good properties with respect to both a randomization and a model distribution.
The field has now reached a troubling crossroads in which response rates to many types of surveys have plummeted and nonprobability datasets are touted as a way of obtaining reasonable quality data at low cost. Sophisticated model-based mathematical methods have been developed for estimation from nonprobability samples. In some applications, e.g., administrative data files that are incomplete due to late reporting, these methods may work well. However, in others the quality of nonprobability sample data is irremediably bad as illustrated by Kennedy in her 2022 Hansen lecture. In some situations, we are back in Morris' 1937 situation where standard approaches no longer work. Methods are needed to evaluate whether acceptable estimates can be made from the most suspect data sets. Nonetheless. nonprobability datasets are readily available now, and it is up to the statistical profession to develop good methods for using them.
- Tags
-