Vaccination Surveys Fell Victim to the “Big Data Paradox"

The tiny margins of error from big data sets boosted trust in poll accuracy but when the CDC disclosed actual immunization rates, two surveys were off — by a lot.

In the spring, Delphi-Facebook and the U.S. Census Bureau used weekly surveys to get near-real time estimates of how many people got the COVID-19 vaccine. They used responses from as many as 250,000 people.

With so many data points, the poll’s accuracy was shown to be very good. This made people more confident that the numbers were correct.

However, when the Centers for Disease Control and Prevention released information on actual reported vaccination rates, the two surveys were significantly off.

The Delphi-Facebook study underestimated vaccine uptake by 17 percentage points by the end of May — 70% versus 53%, according to the CDC — and the Census Bureau’s Household Pulse Survey overestimated vaccine uptake by 14 percentage points.

Comparative analysis by statisticians and political scientists from Harvard, Oxford, Stanford universities acknowledges that the surveys fell victim to a mathematical phenomenon known as “the Big Data Paradox,” which is the mathematical tendency of big data sets to minimize one type of error while magnifying another that tends to get less attention: errors due to systematic biases that make the sample poorly represent the larger population.

A Harvard statistician and Whipple V.N. Jones Professor of Statistics, Xiao-Li Meng, identified and created the term “Big Data Paradox” in his 2018 analysis of polls during the 2016 presidential election. Nonresponse bias, a term used to describe the tendency of Trump supporters to either not reply or classify themselves as “undecided,” distorted those surveys, which predicted a Clinton presidency.

When faced with the contradiction, Meng added, researchers are forced to acknowledge that they don’t know the answer even if they conduct a biased big data survey. Bias can be hidden by a high sample size that gives researchers and later users of survey results a false sense of certainty, as was the case in the 2016 election.

“This is the Big Data Paradox: the larger the data size, the surer we fool ourselves when we fail to account for bias in data collection,” said the authors of the study.

The authors warn out that if measures are performed based on these results, they can be very damaging. Vaccination rates of 70% against COVID, for example, could lead a governor to loosen up on public health regulations. If actual vaccination rates are closer to 55 percent, the move could lead to an increase in cases and an increase in COVID deaths instead of fostering a return to normal life.

“All around the world, policymakers and scientific advisors are trying to make sense of COVID data,” said corresponding author Seth Flaxman.

“Reported cases are a fraction of true infections, COVID-19 attributed deaths are a severe undercount of the true toll of this pandemic, and electronic medical records do not give us the full picture of long COVID. When it comes to survey data, all sorts of data quality issues, such as vaccinated respondents being more likely to respond to surveys and marginalized groups being underrepresented, can lead to incorrect estimates.”

Though it is well acknowledged that survey accuracy is determined by both data quantity and data quality, data quantity has recently taken the spotlight as technology has substantially expanded our ability to collect and handle enormous data sets. Though these potentially offer never-before-seen insights, particularly into previously difficult-to-study subpopulations, if data quality is not prioritized – gained by ensuring your sample population is representative of the larger population or understanding how it differs so results can be adjusted – the results can be misleading.

“There’s this drive to get the biggest data sets possible and modern technology, big data, has made that possible,” said the first author of the study Shiro Kuriwaki.

“What that allows is analysis at a more granular level than ever before, but we need to be mindful that biases in the data get worse with bigger sample size, and that can carry right to the subgroups.”

Meng said he first considered the difficulties posed by big data during a visit to Harvard by a U.S. Census Bureau official a decade ago. The official met with a group of statisticians and asked them about the handling of enormous data sets covering substantial percentages of the US population that were becoming available. Using the IRS as an example, he asked whether statisticians would prefer a sample covering 5% of the population that they knew was typical of the entire population or IRS data that they weren’t sure was representative but covered 80% of the population. The statisticians settled on 5%. “What if it was 90%?” the Census Bureau officer wondered. Even yet, the statisticians chose 5% because, if they understood the facts, their response would be more accurate than even a much bigger sample with unknown biases.

“Every data set is going to have certain quirks, but the question is whether the quirk matters to whatever your problem is,” Meng said.

“Social media has tons of data just sitting there. And they may think they have a public sample, but may not realize that their population is biased to start.”

Nonresponse bias, in fact, persists even when survey researchers are aware of its dangers. For example, despite the introduction of new methodology in the aftermath of 2016, a 2020 essay by Kuriwaki and another coauthor of the present study, Harvard freshman Michael Isakov, correctly forecasted overconfidence in 2020 presidential election polls.

“In the current paper, we found that while both the Delphi-Facebook and Census Bureau researchers attempted to account for potential issues, their corrections were simply not enough to alleviate all of the bias,” Isakov added.

The research, undertaken in collaboration with Dino Sejdinovic at Oxford, highlights areas of potential bias in vaccination polls. The Delphi-Facebook polls were drawn from daily Facebook users but did not take into account factors such as education level — two in ten respondents did not have a college degree, compared to four in ten of all U.S. adults — and race and ethnicity — the proportion of black and Asian respondents was only half that of the general population. The Census Bureau analysis controlled for education and race/ethnicity, but neither poll contained information on respondents’ partisanship, which could have been a role in vaccine uptake. Furthermore, neither study modified their sample to reflect the distribution of urban and rural areas, which the authors suggested could have been a factor.

“The U.S. government is spending billions of dollars this year doing targeted outreach to try to get people who are not vaccinated, vaccinated,” added Valerie Bradley, first author of the paper.

“And if you are guiding that based on the Census Household Pulse or Facebook survey, you might be pouring literally billions of dollars into the wrong communities.”

In comparison, Axios-Ipsos researchers conducting a more traditional survey with only 1,000 respondents took great care to ensure the sample was representative of the larger population. They took into account education, race, ethnicity, political allegiance, and even gave “offline” respondents with tablets with internet connectivity to ensure their opinions were recorded. Despite the smaller sample size, the Axios-Ipsos estimates of vaccine uptake were comparable to the CDC’s stated numbers of immunized people.

According to the authors, the ultimate effect of uncorrected bias in large polls was that the Delphi-Facebook poll, despite surveying 250,000 respondents, had an effective sample size of less than 10 when adjusted for bias in April 2021, a 99.99 percent reduction from their raw average weekly sample size. Similarly, the Census Household Pulse, which received 75,000 responses weekly, had a 99 percent reduced effective sample size in May 2021.

“If you have the resources, invest in data quality far more than you invest in data quantity,” Meng said.

“Bad quality data is essentially wiping out the power you think you have. That’s always been a problem, but it’s magnified now because we have big data. Given the lengths the CDC goes to keep track of how many vaccines are given, we don’t need to rely on survey data to estimate overall vaccination rates. But when it comes to behavior, to which groups have been vaccinated, to hesitancy and barriers to access, accurate surveys are important. As adult vaccine uptake continues to increase, approaching 85 percent with a first dose in the U.S., a little humility about the limits of our knowledge is in order. But we can be sure of one thing: three out of 20 adults in the U.S. have no protection from vaccines, and we need to redouble our efforts to reach them.”

Source: 10.1038/s41586-021-04198-4

Image Credit: Getty

You were reading: Vaccination Surveys Fell Victim to the “Big Data Paradox”

Revyuh
Welcomes You

Company

Search for an article