Brady West - A Novel Methodology for Improving Applications of Modern Predictive Modeling Tools to Linked Data Sets Subject to Mismatch Error -JPSM MPSDS Seminar
views
comments
Brady T. West is a Research Professor in the Survey Methodology Program, located within the Survey Research Center at the Institute for Social Research on the University of Michigan-Ann Arbor (U-M) campus. He earned his PhD from the Michigan Program in Survey and Data Science in 2011. Before that, he received an MA in Applied Statistics from the U-M Statistics Department in 2002, being recognized as an Outstanding First-year Applied Masters student, and a BS in Statistics with Highest Honors and Highest Distinction from the U-M Statistics Department in 2001. His current research interests include the implications of measurement error in auxiliary variables and survey paradata for survey estimation, selection bias in surveys, responsive/adaptive survey design, interviewer effects, and multilevel regression models for clustered and longitudinal data. He is the lead author of a book comparing different statistical software packages in terms of their mixed-effects modeling procedures (Linear Mixed Models: A Practical Guide using Statistical Software, Third Edition, Chapman Hall/CRC Press, 2022), and he is a co-author of a second book entitled Applied Survey Data Analysis (with Steven Heeringa and Pat Berglund), the second edition of which was published by CRC Press in June 2017. He was elected as a Fellow of the American Statistical Association in 2022.
A Novel Methodology for Improving Applications
of Modern Predictive Modeling Tools to Linked Data Sets Subject to Mismatch
Error
In recent years, the rise of social media platforms such as Twitter/X has provided social scientists with a wealth of user-content data, and there has been renewed interest in the utility of administrative records for increasing survey efficiency. Combining social media data, administrative records, and survey data has the potential to produce a comprehensive source of information for social research. These data are often collected from multiple sources and combined by probabilistic record linkage. For the analysis of these linked data files, advanced machine learning techniques, such as random forests, boosting, and related ensemble methods, have become essential tools for survey methodologists and data scientists. There is, however, a potential pitfall in the widespread application of these techniques to linked data sets that needs more attention. Linkage errors such as mismatch and missed-match errors can distort the true relationships between variables and adversely alter the performance metrics routinely output by predictive modeling techniques, such as variable importance, confusion matrices, RMSE, etc. Thus, the actual predictive performance of these machine-learning techniques may not be realized. In this presentation, I will describe a new general methodology designed to adjust modern predictive modeling techniques for the presence of mismatch errors in linked data sets. The proposed approach, based on mixture modeling, is general enough to accommodate various predictive modeling techniques in a unified fashion. I evaluate the performance of the new methodology with simulations implemented in R. I will conclude with recommendations for future work in this area.