I recently tweeted the following:
“No data is better than bad data…” really? if you have no data, how do you know it’s bad data? doh.
This prompted a surprising number of DM’s, follow-up emails and even two in-person conversations. Everyone wholeheartedly agreed with my tweet, which was a delayed reaction to a response I got from a journalist who works for The Economist who in a rather derisive tone tweeted that “no data is better than bad data.” This is of course not the first time I’ve heard this statement so lets explore this issue further.
The first point to note is the rather contradictory nature of the statement “no data is better than bad data.” Indeed, you have to have data in order to deem it as bad in the first place. But Mr. Economist and company clearly overlook this little detail. Having “bad” data requires that this data be bad relative to other data and thus having said other data in the first place. So if data point A is bad compared to data point B, then by definition data point B is available and good data relative to A. I’m not convinced that a data point is either “good or bad” a priori unless the methods that produce that data are well understood and can themselves be judged. Of course, validating methods requires the comparison of data as well.
In any case, the problem is not bad versus good data, in my opinion. The question has to do with error margins. The vast majority of data shared seldom comes with associated error margins or any indication regarding the reliability of the data. This rightly leads to questions over data quality. I believe that introducing a simple lykert scale to tag the perceived quality of the data can go a long way. This is what we did back in 2003/2004 when I was on the team that launched the Conflict Early Warning and Response Network (CEWARN) in the Horn of Africa. While I still wonder whether the project had any real impact on conflict prevention since it launched in 2004, I believe that the initiative’s approach to information collection was pioneering at the time.
The screenshot below is of CEWARN’s online Incident Report Form. Note the “Information Source” and “Information Credibility” fields. These were really informative for us when aggregating the data and studying the corresponding time series. They allowed us to at least gain a certain level of understanding regarding the possible reliability of depicted trends over time. Indeed, we could start quantifying the level of uncertainty or margin of error. Interestingly, this also allowed us to look for patterns in varying credibility scores. Of course, these were perhaps largely based on perceptions but I believe this extra bit of information is worth having if the alternative is no qualifications on the possible credibility of individual reports.
Fast forward to 2011 and you see the same approach taken with the Ushahidi platform. The screenshot below is of the Matrix plugin for Ushahidi developed in partnership with ICT4Peace. The plugin allows reporters to tag reports with the reliability of the source and the probability that the information is correct. The result is the following graphic representing the trustworthiness of the report.
Some closing thoughts: many public health experts that I have spoken to in the field of emergency medicine repeatedly state they would rather have some data that is not immediately verifiable than no data at all. Indeed, in some ways all data begins life this way. They would rather have a potential rumor about a disease outbreak on their radar which they can follow up on and verify than have nothing appear on their radar until it’s too late if said rumor turns out to be true.
Finally, as noted in my previous post on “Tweetsourcing”, while some fear that bad data can cost lives, this doesn’t mean that no data doesn’t cost lives, especially in a crisis zone. Indeed, time is the most perishable commodity during a disaster—the “sell by” date of information is calculated in hours rather than days. This is in no way implies that I’m an advocate for bad data! The risks of basing decisions on bad data are obvious. At the end of the day, the question is about tolerance for uncertainty—different disciplines will have varying levels of tolerance depending on the situation, time and place. In sum, making the sweeping statement “no data is better than bad data” can come across as rather myopic.