“No Data is Better Than Bad Data…” Really?

I recently tweeted the following:

“No data is better than bad data…” really? if you have no data, how do you know it’s bad data? doh.

This prompted a surprising number of DM’s, follow-up emails and even two in-person conversations. Everyone wholeheartedly agreed with my tweet, which was a delayed reaction to a response I got from a journalist who works for The Economist who in a rather derisive tone tweeted that “no data is better than bad data.” This is of course not the first time I’ve heard this statement so lets explore this issue further.

The first point to note is the rather contradictory nature of the statement “no data is better than bad data.” Indeed, you have to have data in order to deem it as bad in the first place. But Mr. Economist and company clearly overlook this little detail. Having “bad” data requires that this data be bad relative to other data and thus having said other data in the first place. So if data point A is bad compared to data point B, then by definition data point B is available and good data relative to A. I’m not convinced that a data point is either “good or bad” a priori unless the methods that produce that data are well understood and can themselves be judged. Of course, validating methods requires the comparison of data as well.

In any case, the problem is not bad versus good data, in my opinion. The question has to do with error margins. The vast majority of data shared seldom comes with associated error margins or any indication regarding the reliability of the data. This rightly leads to questions over data quality. I believe that introducing a simple lykert scale to tag the perceived quality of the data can go a long way. This is what we did back in 2003/2004 when I was on the team that launched the Conflict Early Warning and Response Network (CEWARN) in the Horn of Africa. While I still wonder whether the project had any real impact on conflict prevention since it launched in 2004, I believe that the initiative’s approach to information collection was pioneering at the time.

The screenshot below is of CEWARN’s online Incident Report Form. Note the “Information Source” and “Information Credibility” fields. These were really informative for us when aggregating the data and studying the corresponding time series. They allowed us to at least gain a certain level of understanding regarding the possible reliability of depicted trends over time. Indeed, we could start quantifying the level of uncertainty or margin of error. Interestingly, this also allowed us to look for patterns in varying credibility scores. Of course, these were perhaps largely based on perceptions but I believe this extra bit of information is worth having if the alternative is no qualifications on the possible credibility of individual reports.

Fast forward to 2011 and you see the same approach taken with the Ushahidi platform. The screenshot below is of the Matrix plugin for Ushahidi developed in partnership with ICT4Peace. The plugin allows reporters to tag reports with the reliability of the source and the probability that the information is correct. The result is the following graphic representing the trustworthiness of the report.

Some closing thoughts: many public health experts that I have spoken to in the field of emergency medicine repeatedly state they would rather have some data that is not immediately verifiable than no data at all. Indeed, in some ways all data begins life this way. They would rather have a potential rumor about a disease outbreak on their radar which they can follow up on and verify than have nothing appear on their radar until it’s too late if said rumor turns out to be true.

Finally, as noted in my previous post on “Tweetsourcing”, while some fear that bad data can cost lives, this doesn’t mean that no data doesn’t cost lives, especially in a crisis zone. Indeed, time is the most perishable commodity during a disaster—the “sell by” date of information is calculated in hours rather than days. This is in no way implies that I’m an advocate for bad data! The risks of basing decisions on bad data are obvious. At the end of the day, the question is about tolerance for uncertainty—different disciplines will have varying levels of tolerance depending on the situation, time and place. In sum, making the sweeping statement “no data is better than bad data” can come across as rather myopic.

7 responses to ““No Data is Better Than Bad Data…” Really?

  1. Thanks for this post, Patrick. The Global Digital Activism Data Set also has to worry about the quality and breadth of data that we are collecting about digital activism cases around the world. Is a source accurate? Is it complete? We’d rather have an incomplete source than an inaccurate one (ie, errors of omission preferred to errors of commission or misreporting).

    Maybe we should be evaluating the “bad data” more carefully. Identifying how data is “bad” might help us determine whether or not it is useful in the absence of data that is “good.”

  2. David Saunders

    As someone who has led teams managing mostly recovery activities or supporting government in transition between the chaos and relief and the more consistent work in development, I find it most common that we work with a changing lens of information. Uncertainty is very high in early days, as data quality improves feedback provides new uncertainty in more detailed areas, then of course the planning horizon, whethern 3mth, 12 mth or 5 yr is always based on a large degree of assumptions, facts later clarifying reality and forcing adjustments. Which are often painful due to poor planning and mitigation.

    Working with inaccurate data is a reality much of the time in humanitarian and development work and probably for others too, of course there are many ways to reduce error or mitigate risk. Often selective type and volume of bad data is key. Perhaps the statement should read “selective bad data is better than only bad data”.

  3. I’m so glad that there are others out there that share my passion for research and data and the necessity of both in programming. Tell me, what are your thoughts on proxies? Do these count as bad data or just another risk to consider on a log frame?


    • Thanks for your comment, Julie. I suppose proxies can be bad data, as any data can be. But they’re better than nothing. I suppose the trick is to really understand as well as possible the imagined correlation between the proxy and the actual variable we’re interested in.

  4. I think you have a point Patrick. In crisis management getting accurate data is one of the biggest challenges we face but in my experience families would rather be told what we know – even if we’re not completely sure – than be told nothing. And we would rather work from ‘some’ picture of what the situation is with British Nationals than no picture. We err on the side of caution before going public in the media – we need to be sure for that – but to work with we take what we can get!

    • Many thanks for reading and sharing, Juliet, your professional feedback on this is very validating! I have just added the following sentence to the last paragraph (which I have used in presentations/talks in the past):

      “Indeed, time is the most perishable commodity during a disaster–the “sell by” date of information is calculated in hours rather than days.”

      As a humanitarian colleague of mine in Geneva is fond of saying: “We don’t need to be absolutely correct, just approximately right and not completely wrong.”

      Thanks again!

  5. Pingback: Big Data for Disaster Response: A List of Wrong Assumptions | iRevolution

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s