February 16, 2022
By Brendan J. Kelly, MD, MS
A recently published report from Bradley and colleagues in Nature presents an analysis of discrepancies between two large surveys of early COVID-19 vaccine uptake and subsequent, gold-standard benchmark data published by the Centers for Disease Control and Prevention.
The authors found that the two large surveys of U.S adults (Delphi-Facebook and Census Household Pulse) significantly overestimated early COVID-19 vaccine uptake by 17%. The authors compared the poor performance of the two large surveys to a smaller survey (Axios-Ipsos Coronavirus Tracker), which performed better. They highlight “the big data paradox” — the fact that conventional formulas for statistical uncertainty mislead when applied to surveys with systematic sampling bias, because as sample size increases, the bias dominates the estimator error.
In their report, the authors propose a novel framework for quantifying survey data quality, which adds to the “data scarcity” (survey sampling error, which is typically used) two additional components: the “data quality defect” (the correlation between the event that an individual’s response is recorded and the response itself) and the “inherent problem difficulty” (population heterogeneity). They found that the errors in the two large surveys were dominated by increasing data quality defects. Though the raw weekly sample size of the largest survey was 250,000, the authors found that its bias-adjusted sample size was less than 10. Comparing the survey methods, the authors found no single factor that drove bias: Panel recruitment, sampling, and weighting the survey data all contributed.
They conclude with a statement of caution for the big data era, urging clinicians, scientists, and policymakers to consider that large sample sizes can actually exacerbate the effects of small biases in data collection and lead to incorrect inferences.
(Bradley et al. Nature. 2021;600:695-700.)