Misuse of data and breakdowns in communication have happened at every step of the coronavirus’ spread. It’s up to statisticians to better articulate the numbers.
In this two-part series written by University of Montana graduate students, the authors explore the role of data science in solving the coronavirus pandemic. You can find the first article here, which focuses on data science and health care during the pandemic.
Effective data communication is the hardest part of any data scientist’s job. Scott Berinato’s January 2019 article in the Harvard Business Review, “Data Science and the Art of Persuasion,” refers to it as the last mile because visualization and communication are generally the last steps in an analytical project.
When that job is communicating with executives about business strategy, it can be challenging enough. But getting hundreds of millions, if not billions, of people to understand the severity of this virus, the importance of social distancing, washing hands and restricting travel, make it doubly challenging.
Data scientists must communicate complex data in a way that is both easy to understand and accurately conveys its scientific rigor. According to Berinato, data is often misunderstood, misrepresented and misused because of three root causes that occur in the last mile, issues related to the “factory and the foreman,” the “convenient truth” and the “statistician’s curse.” The factory and the foreman refers to when a person in a position of power has an idea or project that she wants to push through, so that the foreman will ask the data scientist to create charts and analysis that back up the desired viewpoint. The convenient truth occurs when the information design team builds charts that simplify complex ideas, which leads to decision-makers drawing the wrong conclusions from analysis. And the statistician’s curse refers to statisticians spending less time on visual communication and using language that is confusing to those without a stats-heavy background. All three scenarios have come into play as the coronavirus has engulfed the world.
Factory and the Foreman
Reporting confirmed cases and deaths of coronavirus seems like it should be straightforward, but some common misconceptions have shaped the conversation around the pandemic. An overreliance on the numbers without acknowledgement of their flaws (including underrepresentation of the true number of cases) has caused people to falsely believe that the outbreak is not as serious as it is.
For example, the days leading up to March 11, when Utah Jazz center Rudy Gobert tested positive for COVID-19, were marked by sports leagues being combative toward city ordinances that restricted large public gatherings. As public health officials urged teams to not allow fans to attend games, some initially refused to comply. At that point, the U.S. reported only 1,080 confirmed cases and 31 deaths, as well as a serious lack of testing capabilities. For example, the state of Oklahoma, where the Jazz were slated to play on March 11, had a testing capacity of only 100 tests per day.
During this time, doctors were saying that without testing, there was no way to know how extensive the infection was or who to isolate. Some leaders appeared to be more worried about their profits than the human toll; they relied on confirmed cases during these early stages rather than the insight of epidemiologists who were trying to emphasize that the seriously low testing numbers meant that conclusions based on confirmed cases was misleading. This example of the factory and foreman issue likely set the U.S. back weeks in our fight to contain the pandemic.
Another way data has been misinterpreted is by not understanding how exponential growth works. Since the tenth reported death of coronavirus in the U.S., deaths steadily doubled every three days through April. Exponential growth means that making decisions based solely on a snapshot in time—without factoring in how it’ll be completely outdated within days—can lead to terrible decision-making.
For example, on March 26, Rudy Giuliani tweeted, “Approximately, 7,500 people die every day in the United States. That’s approximately 645,000 people so far this year. Coronavirus has killed about 1,000 Americans this year. Just a little perspective.” Within days, that number of deaths had soared to well above 3,000—the same number of people who perished during the Sept. 11 terrorist attacks. President Trump stated around the same time that between 100,000 and 240,000 people would die from the virus, about the same number of American soldiers killed in battle during World War II. In a four-day span, the number of possible deaths went from being downplayed to being likened to a war.
This example offers a clear illustration of convenient truth, when someone cherry-picked a statistic and came to an erroneous conclusion. When faced with an exponential growth curve, one cannot take a number out of context of the overall trend: Until the exponential trend begins to level off, the current data snapshot is not reflective of the situation.
Due to its long incubation period, asymptomatic carriers and lack of historical data, coronavirus death tolls and total cases are extremely difficult to estimate. The models used to create these estimates rely on key assumptions.
One of the most influential assumptions is how seriously the public is following federal guidelines to stay at home and to commit to social distancing. If, as the White House Coronavirus Task Force believes, all federal guidelines are followed closely, then early models predicted COVID-19 would kill between 100,000 and 240,000 Americans. On the other hand, experts who estimated that the guidelines were not being followed closely—and that those who were following the guidelines would slowly lessen their commitment—predicted between 263,000 and 1.7 million deaths.
Despite the huge difference in estimates, there’s a possibility that neither is wrong. Each depends on the behavior of the population. Where people take quarantining seriously, there’s evidence that reported cases are starting to plateau in hard-hit locations, and death estimates have dropped even further. This decline should be seen as a sliver of good news during bad times. However, because these estimates are changing, some have misinterpreted the changes in predictions as proof that these scientists don’t know what they’re doing. This inability of statisticians to clearly explain the models is a classic example of the statistician’s curse and can create doubt about appropriate actions and questions about the severity of the situation.
Whether due to people who want to believe flawed numbers, pick convenient truths or misinterpret model predictions, misuse of data and breakdowns in communication have happened every step of the way, putting millions of lives at risk. As data scientists continue to provide insights to guide decision makers and the public through this difficult time, they must overcome the last-mile challenges in order to effectively communicate the data—and for the public, the media and government officials to accurately understand complicated results.
Photo by Markus Spiske on Unsplash.