Digital Disease Detection: Tracking Influenza with…Wikipedia?

Hello loyal Disease Daily readers! I am here to tell you about a new method of influenza-like illness surveillance, using a digital data source. Another ILI (“ILI” is an acronym for influenza-like illness, for any newcomers to the site) surveillance system? Yes – another one. Currently, there are quite a few surveillance systems that do a good job of estimating the number of people in the United States experiencing ILI, such as the surveillance systems run by the Center for Disease Control and Prevention (CDC) and Google Flu Trends (GFT), but at HealthMap, we were interested in using a new type of data – one that was completely free, and that anyone could access any time they wanted. This is what led me to Wikipedia. 

And no – I don’t mean I went to Wikipedia and started searching for articles on ways to get free data to estimate ILI. Instead, we looked at the number of times specific Wikipedia articles were viewed each day, to see if this data showed any type of similarity to the trends in ILI prevalence. We looked at articles that were related to influenza (Influenza, H1N1, Tamiflu), some that were less specific but still related to health (Fever, Epidemic, Vaccine), and also the Wikipedia Main page, which served as a measure of regular background level of Wikipedia activity. With the 32 Wikipedia articles we chose to investigate, we used a simple tool, (, created by a Wikipedia user to figure out how many times these articles were viewed, every day, between December 10, 2007 (the first date that this data is available) and August 19, 2013. Once we gathered it all up, we created a few models to see how well this data could estimate ILI activity in the United States.

Our model estimated the proportion of individuals in the United States who were experiencing ILI, and was therefore modeled using a Poisson family generalized linear model, with a log-link function.  The last sentence might sound completely foreign to some people but the takeaway is that we used a Poisson model, which is a modeling approach typically used for modeling count data – for example, the number of accidents measured for 100 different intersections. We developed two models. In one model, we included all 32 Wikipedia articles of interest and in another we only included a subset of articles that best describe the ILI estimates from the CDC.

So – how did we do? What can Wikipedia tell us about ILI activity?

We compared our ILI estimates to those from official CDC data. We used CDC data as a “gold standard,” or something that we know to be of excellent quality to tell us how our data fared in comparison. We found that our data was within 0.27 percent, on average, of the ILI value that the CDC reported. For comparison: over the same time period, Google Flu Trends estimated ILI values that had an average difference of 0.42 percent compared to CDC data.

When it came to accurately estimating which week of the year had the highest level of ILI activity, our Wikipedia model was correct 17 percent more often than GFT estimates, with three out of six influenza seasons correctly identified.

Now, you may have heard in recent articles and posts that Google Flu Trends’ accuracy has been questioned. While Google Flu Trends has historically been quite accurate in its estimated of ILI, it got a little confused with the abnormally severe flu season of 2012-2013, and in 2009 because of the media-hype surrounding the 2009 swine flu H1N1 pandemic. Because these events spur the public to search the Internet for flu information, Google’s system incorrectly estimated that many more people were sick than actually turned out to be. This can be a difficult problem to overcome. Several of the Wikipedia articles that we observed (such as Influenza) also showed very large spikes in user activity during, especially during pH1N1. Our model was able to deal with these spikes  by looking both at articles that had these spikes in views, as well as those that did not. By looking at both, we could determine that those big spikes in activity weren’t really people who were sick, but more likely just seeking information. Regardless – it’s almost impossible to tell if someone online, whether using Google or reading articles on Wikipedia, actually has the flu, or if they are simply curious about influenza. This is one of the big hurdles in Digital Disease Detection.

So – what have we learned? Not only has our Wikipedia model been shown to be accurate in its estimates of ILI, and quick in doing so (in theory, new estimates could be generated every hour – whereas the CDC reports can take up to two weeks), but just as importantly, it was done with data that is completely free for everyone to use. Since Wikipedia makes its data available for free, anyone who is interested can replicate our work, or improve upon it, or come up with some new use for the data. Maybe someday soon we will be using Wikipedia data to estimate the prevalence of diabetes in a population, or a certain type of cancer, or predicting which city might be next to have an outbreak of dengue fever. 

Related Posts