Big Data for Development: Challenges and Opportunities

The UN Global Pulse report on Big Data for Development ought to be required reading for anyone interested in humanitarian applications of Big Data. The purpose of this post is not to summarize this excellent 50-page document but to relay the most important insights contained therein. In addition, I question the motivation behind the unbalanced commentary on Haiti, which is my only major criticism of this otherwise authoritative report.

Real-time “does not always mean occurring immediately. Rather, “real-time” can be understood as information which is produced and made available in a relatively short and relevant period of time, and information which is made available within a timeframe that allows action to be taken in response i.e. creating a feedback loop. Importantly, it is the intrinsic time dimensionality of the data, and that of the feedback loop that jointly define its characteristic as real-time. (One could also add that the real-time nature of the data is ultimately contingent on the analysis being conducted in real-time, and by extension, where action is required, used in real-time).”

Data privacy “is the most sensitive issue, with conceptual, legal, and technological implications.” To be sure, “because privacy is a pillar of democracy, we must remain alert to the possibility that it might be compromised by the rise of new technologies, and put in place all necessary safeguards.” Privacy is defined by the International Telecommunications Union as theright of individuals to control or influence what information related to them may be disclosed.” Moving forward, “these concerns must nurture and shape on-going debates around data privacy in the digital age in a constructive manner in order to devise strong principles and strict rules—backed by adequate tools and systems—to ensure “privacy-preserving analysis.”

Non-representative data is often dismissed outright since findings based on such data cannot be generalized beyond that sample. “But while findings based on non-representative datasets need to be treated with caution, they are not valueless […].” Indeed, while the “sampling selection bias can clearly be a challenge, especially in regions or communities where technological penetration is low […],  this does not mean that the data has no value. For one, data from “non-representative” samples (such as mobile phone users) provide representative information about the sample itself—and do so in close to real time and on a potentially large and growing scale, such that the challenge will become less and less salient as technology spreads across and within developing countries.”

Perceptions rather than reality is what social media captures. Moreover, these perceptions can also be wrong. But only those individuals “who wrongfully assume that the data is an accurate picture of reality can be deceived. Furthermore, there are instances where wrong perceptions are precisely what is desirable to monitor because they might determine collective behaviors in ways that can have catastrophic effects.” In other words, “perceptions can also shape reality. Detecting and understanding perceptions quickly can help change outcomes.”

False data and hoaxes are part and parcel of user-generated content. While the challenges around reliability and verifiability are real, Some media organizations, such as the BBC, stand by the utility of citizen reporting of current events: “there are many brave people out there, and some of them are prolific bloggers and Tweeters. We should not ignore the real ones because we were fooled by a fake one.” And have thus devised internal strategies to confirm the veracity of the information they receive and chose to report, offering an example of what can be done to mitigate the challenge of false information.” See for example my 20-page study on how to verify crowdsourced social media data, a field I refer to as information forensics. In any event, “whether false negatives are more or less problematic than false positives depends on what is being monitored, and why it is being monitored.”

“The United States Geological Survey (USGS) has developed a system that monitors Twitter for significant spikes in the volume of messages about earthquakes,” and as it turns out, 90% of user-generated reports that trigger an alert have turned out to be valid. “Similarly, a recent retrospective analysis of the 2010 cholera outbreak in Haiti conducted by researchers at Harvard Medical School and Children’s Hospital Boston demonstrated that mining Twitter and online news reports could have provided health officials a highly accurate indication of the actual spread of the disease with two weeks lead time.”

This leads to the other Haiti example raised in the report, namely the finding that SMS data was correlated with building damage. Please see my previous blog posts here and here for context. What the authors seem to overlook is that Benetech apparently did not submit their counter-findings for independent peer-review whereas the team at the European Commission’s Joint Research Center did—and the latter passed the peer-review process. Peer-review is how rigorous scientific work is validated. The fact that Benetech never submitted their blog post for peer-review is actually quite telling.

In sum, while this Big Data report is otherwise strong and balanced, I am really surprised that they cite a blog post as “evidence” while completely ignoring the JRC’s peer-reviewed scientific paper published in the Journal of the European Geosciences Union. Until counter-findings are submitted for peer review, the JRC’s results stand: unverified, non-representative crowd-sourced text messages from the disaster affected population in Port-au-Prince that were in turn translated from Haitian Creole to English via a novel crowdsourced volunteer effort and subsequently geo-referenced by hundreds of volunteers  which did not undergo any quality control, produced a statistically significant, positive correlation with building damage.

In conclusion, “any challenge with utilizing Big Data sources of information cannot be assessed divorced from the intended use of the information. These new, digital data sources may not be the best suited to conduct airtight scientific analysis, but they have a huge potential for a whole range of other applications that can greatly affect development outcomes.”

One such application is disaster response. Earlier this year, FEMA Administrator Craig Fugate, gave a superb presentation on “Real Time Awareness” in which he relayed an example of how he and his team used Big Data (twitter) during a series of devastating tornadoes in 2011:

“Mr. Fugate proposed dispatching relief supplies to the long list of locations immediately and received pushback from his team who were concerned that they did not yet have an accurate estimate of the level of damage. His challenge was to get the staff to understand that the priority should be one of changing outcomes, and thus even if half of the supplies dispatched were never used and sent back later, there would be no chance of reaching communities in need if they were in fact suffering tornado damage already, without getting trucks out immediately. He explained, “if you’re waiting to react to the aftermath of an event until you have a formal assessment, you’re going to lose 12-to-24 hours…Perhaps we shouldn’t be waiting for that. Perhaps we should make the assumption that if something bad happens, it’s bad. Speed in response is the most perishable commodity you have…We looked at social media as the public telling us enough information to suggest this was worse than we thought and to make decisions to spend [taxpayer] money to get moving without waiting for formal request, without waiting for assessments, without waiting to know how bad because we needed to change that outcome.”

“Fugate also emphasized that using social media as an information source isn’t a precise science and the response isn’t going to be precise either. “Disasters are like horseshoes, hand grenades and thermal nuclear devices, you just need to be close— preferably more than less.”

5 responses to “Big Data for Development: Challenges and Opportunities

  1. Thanks Patrick,
    I’m the author of the report and Robert K. just pointed me to your comment about the Haiti example. Let me just say before looking deeper into the issue that the report did not intend to endorse but rather to present /explain Benetech’s criticism, as i thought it contained interesting analytical/econometric insights with broader relevance. The point you are making here do raise a number of questions that i may have missed about the robustness of this criticism and i will certainly look into it. We will get back to you soon.
    Thanks a lot.
    Emmanuel

    • Hi Emmanuel, many thanks for your comment and for writing such an awesome report. You’ve really distilled some conceptually challenging issues in such an elegant way, thus making it easier for the rest of us to advocate for Big Data applications to the development and humanitarian space. So a big thank you! I agree that Jim’s critique is a useful way to illustrate the issue of spurious correlations. I just have my doubts that the Haiti study qualifies as a valid example. But I could certainly be wrong which is why I prefer to cite peer-reviewed papers when on uncertain territory–econometrics is certainly not my strength.

      Thanks very much for writing such an excellent report. Looking forward to the next one! 🙂

  2. Hi Patrick,

    Thanks again for your review of the paper.

    As promised I am following up on our discussion about what you have described as an “unbalanced commentary” on the debate that you—and colleagues at Ushahidi such as Erik Hersman—and Jim Fruchterman—and colleagues at Benetech—engaged in last year.

    As you noted, one of the main objectives of Global Pulse’s report was to provide an overview of the key challenges with using big data for development to a relatively wide—read: not necessarily technical—audience, which did not allow us to go into all the specifics of some arguments and examples. Given your involvement in and contribution to the field, I realize that the account of the debate in the report may seem truncated—as it indeed is for the reasons explained above—which is why I thought a fuller response was warranted.

    Let me take this opportunity to make two series of brief comments.

    First, as I wrote in my initial comment above, the report did not intend to take side on the issue; specifically it did not characterize Benetech’s contribution as “evidence”. It did, however, consider that their overall approach was solid enough, and thus reported their counter-findings, to illustrate the concept and risk of spurious correlations in crowdsourced data (and any other types of data for that matter), even if, as you have pointed out, these findings had not been submitted for peer-review. Whereas the peer-review process continues to constitute the golden standard of research, and will continue to do so in the foreseeable future (but will it forever? this probably warrants a separate discussion), I don’t think that all non peer-reviewed claims should be discarded on that sole basis—it all depends on the use that is made of them. In this case, beyond statistical finessing about various specifications, the key argument is a powerful and important one, which you are evidently well aware of: in the aftermath of a disaster, the very consequences of the disaster (e.g. higher mortality in hardest hit areas) require being cautious about using crowdsourced data for policy purposes.

    Second, as just hinted in this last sentence, looking more deeply into the entire history of the case (i.e. the whole trail of online interactions, comments, etc.) and your own writings (especially on this blog) on the subject, it appears to me that you and “the good people at Benetech” actually agree on a lot of important points (as Jeff Klingner pointed out commenting on a post by Erik Hersman on the subject), which should and does not come as a surprise. In particular, the notion of “allsourcing” that you put forth a few years ago shares strikingly similar features with those of the “successful approach to crowdsourcing data” taken by a “sophisticated user” advocated by Benetech more recently. As we all know, what is needed is to keep building awareness, intent and capacities to make sense of big data (including crowdsourced, bounded and unbounded) to maximize their potential while limiting the risk associated with their (potentially) simplistic use. And I think having these kinds of debates on the value of speed vs. accuracy, practical vs. academic considerations, etc. is what will get us closer to this objective.

    Thanks!

    Emmanuel

  3. Pingback: Big Data for Development: From Information to Knowledge Societies? | iRevolution

  4. Pingback: Digging Down to the Micro-Foundations | Dart-Throwing Chimp

Leave a reply to Emmanuel Letouzé Cancel reply