Tag Archives: data

How to Create Resilience Through Big Data

Revised! I have edited this article several dozen times since posting the initial draft. I have also made a number of substantial changes to the flow of the article after discovering new connections, synergies and insights. In addition, I  have greatly benefited from reader feedback as well as the very rich conversa-tions that took place during the PopTech & Rockefeller workshop—a warm thank you to all participants for their important questions and feedback!

Introduction

I’ve been invited by PopTech and the Rockefeller Foundation to give the opening remarks at an upcoming event on interdisciplinary dimensions of resilience, which is  being hosted at Georgetown University. This event is connected to their new program focus on “Creating Resilience Through Big Data.” I’m absolutely de-lighted to be involved and am very much looking forward to the conversations. The purpose of this blog post is to summarize the presentation I intend to give and to solicit feedback from readers. So please feel free to use the comments section below to share your thoughts. My focus is primarily on disaster resilience. Why? Because understanding how to bolster resilience to extreme events will provide insights on how to also manage less extreme events, while the converse may not be true.

Big Data Resilience

terminology

One of the guiding questions for the meeting is this: “How do you understand resilience conceptually at present?” First, discourse matters.  The term resilience is important because it focuses not on us, the development and disaster response community, but rather on local at-risk communities. While “vulnerability” and “fragility” were used in past discourse, these terms focus on the negative and seem to invoke the need for external protection, overlooking the fact that many local coping mechanisms do exist. From the perspective of this top-down approach, international organizations are the rescuers and aid does not arrive until these institutions mobilize.

In contrast, the term resilience suggests radical self-sufficiency, and self-sufficiency implies a degree of autonomy; self-dependence rather than depen-dence on an external entity that may or may not arrive, that may or may not be effective, and that may or may not stay the course. The term “antifragile” just recently introduced by Nassim Taleb also appeals to me. Antifragile sys-tems thrive on disruption. But lets stick with the term resilience as anti-fragility will be the subject of a future blog post, i.e., I first need to finish reading Nassim’s book! I personally subscribe to the following definition of resilience: the capacity for self-organization; and shall expand on this shortly.

(See the Epilogue at the end of this blog post on political versus technical defini-tions of resilience and the role of the so-called “expert”. And keep in mind that poverty, cancer, terrorism etc., are also resilient systems. Hint: we have much to learn from pernicious resilience and the organizational & collective action models that render those systems so resilient. In their book on resilience, Andrew Zolli and Ann Marie Healy note the strong similarities between Al-Qaeda & tuber-culosis, one of which are the two systems’ ability to regulate their metabolism).

Hazards vs Disasters

In the meantime, I first began to study the notion of resilience from the context of complex systems and in particular the field of ecology, which defines resilience as “the capacity of an ecosystem to respond to a perturbation or disturbance by resisting damage and recovering quickly.” Now lets unpack this notion of perturbation. There is a subtle but fundamental difference between disasters (processes) and hazards (events); a distinction that Jean-Jacques Rousseau first articulated in 1755 when Portugal was shaken by an earthquake. In a letter to Voltaire one year later, Rousseau notes that, “nature had not built [process] the houses which collapsed and suggested that Lisbon’s high population density [process] contributed to the toll” (1). In other words, natural events are hazards and exogenous while disas-ters are the result of endogenous social processes. As Rousseau added in his note to Voltaire, “an earthquake occurring in wilderness would not be important to society” (2). That is, a hazard need not turn to disaster since the latter is strictly a product or calculus of social processes (structural violence).

And so, while disasters were traditionally perceived as “sudden and short lived events, there is now a tendency to look upon disasters in African countries in particular, as continuous processes of gradual deterioration and growing vulnerability,” which has important “implications on the way the response to disasters ought to be made” (3). (Strictly speaking, the technical difference between events and processes is one of scale, both temporal and spatial, but that need not distract us here). This shift towards disasters as processes is particularly profound for the creation of resilience, not least through Big Data. To under-stand why requires a basic introduction to complex systems.

complex systems

All complex systems tend to veer towards critical change. This is explained by the process of Self-Organized Criticality (SEO). Over time, non-equilibrium systems with extended degrees of freedom and a high level of nonlinearity become in-creasingly vulnerable to collapse. Social, economic and political systems certainly qualify as complex systems. As my “alma mater” the Santa Fe Institute (SFI) notes, “The archetype of a self-organized critical system is a sand pile. Sand is slowly dropped onto a surface, forming a pile. As the pile grows, avalanches occur which carry sand from the top to the bottom of the pile” (4). That is, the sand pile becomes increasingly unstable over time.

Consider an hourglass or sand clock as an illustration of self-organized criticality. Grains of sand sifting through the narrowest point of the hourglass represent individual events or natural hazards. Over time a sand pile starts to form. How this process unfolds depends on how society chooses to manage risk. A laisser-faire attitude will result in a steeper pile. And grain of sand falling on an in-creasingly steeper pile will eventually trigger an avalanche. Disaster ensues.

Why does the avalanche occur? One might ascribe the cause of the avalanche to that one grain of sand, i.e., a single event. On the other hand, a complex systems approach to resilience would associate the avalanche with the pile’s increasing slope, a historical process which renders the structure increasingly vulnerable to falling grains. From this perspective, “all disasters are slow onset when realisti-cally and locally related to conditions of susceptibility”. A hazard event might be rapid-onset, but the disaster, requiring much more than a hazard, is a long-term process, not a one-off event. The resilience of a given system is therefore not simply dependent on the outcome of future events. Resilience is the complex product of past social, political, economic and even cultural processes.

dealing with avalanches

Scholars like Thomas Homer-Dixon argue that we are becoming increasingly prone to domino effects or cascading changes across systems, thus increasing the likelihood of total synchronous failure. “A long view of human history reveals not regular change but spasmodic, catastrophic disruptions followed by long periods of reinvention and development.” We must therefore “reduce as much as we can the force of the underlying tectonic stresses in order to lower the risk of synchro-nous failure—that is, of catastrophic collapse that cascades across boundaries between technological, social and ecological systems” (5).

Unlike the clock’s lifeless grains of sand, human beings can adapt and maximize their resilience to exogenous shocks through disaster preparedness, mitigation and adaptation—which all require political will. As a colleague of mine recently noted, “I wish it were widely spread amongst society  how important being a grain of sand can be.” Individuals can “flatten” the structure of the sand pile into a less hierarchical but more resilience system, thereby distributing and diffusing the risk and size of an avalanche. Call it distributed adaptation.

operationalizing resilience

As already, the field of ecology defines  resilience as “the capacity of an ecosystem to respond to a perturbation or disturbance by resisting damage and recovering quickly.” Using this understanding of resilience, there are at least 2 ways create more resilient “social ecosystems”:

  1. Resist damage by absorbing and dampening the perturbation.
  2. Recover quickly by bouncing back or rather forward.

Resisting Damage

So how does a society resist damage from a disaster? As hinted earlier, there is no such thing as a “natural” disaster. There are natural hazards and there are social systems. If social systems are not sufficiently resilient to absorb the impact of a natural hazard such as an earthquake, then disaster unfolds. In other words, hazards are exogenous while disasters are the result of endogenous political, economic, social and cultural processes. Indeed, “it is generally accepted among environmental geographers that there is no such thing as a natural disaster. In every phase and aspect of a disaster—causes, vulnerability, preparedness, results and response, and reconstruction—the contours of disaster and the difference between who lives and dies is to a greater or lesser extent a social calculus” (6).

So how do we apply this understanding of disasters and build more resilient communities? Focusing on people-centered early warning systems is one way to do this. In 2006, the UN’s International Strategy for Disaster Reduction (ISDR) recognized that top-down early warning systems for disaster response were increasingly ineffective. They thus called for a more bottom-up approach in the form of people-centered early warning systems. The UN ISDR’s Global Survey of Early Warning Systems (PDF), defines the purpose of people-centered early warning systems as follows:

“… to empower individuals and communities threatened by hazards to act in sufficient time and in an appropriate manner so as to reduce the possibility of personal injury, loss of life, damage to property and the environment, and loss of livelihoods.”

Information plays a central role here. Acting in sufficient time requires having timely information about (1) the hazard/s, (2) our resilience and (3) how to respond. This is where information and communication technologies (ICTs), social media and Big Data play an important role. Take the latter, for example. One reason for the considerable interest in Big Data is prediction and anomaly detection. Weather and climatic sensors provide meteorologists with the copious amounts of data necessary for the timely prediction of weather patterns and  early detection of atmospheric hazards. In other words, Big Data Analytics can be used to anticipate the falling grains of sand.

Now, predictions are often not correct. But the analysis of Big Data can also help us characterize the sand pile itself, i.e., our resilience, along with the associated trends towards self-organized criticality. Recall that complex systems tend towards instability over time (think of the hourglass above). Thanks to ICTs, social media and Big Data, we now have the opportunity to better characterize in real-time the social, economic and political processes driving our sand pile. Now, this doesn’t mean that we have a perfect picture of the road to collapse; simply that our picture is clearer than ever before in human history. In other words, we can better measure our own resilience. Think of it as the Quantified Self move-ment applied to an entirely different scale, that of societies and cities. The point is that Big Data can provide us with more real-time feedback loops than ever before. And as scholars of complex systems know, feedback loops are critical for adaptation and change. Thanks to social media, these loops also include peer-to-peer feedback loops.

An example of monitoring resilience in real-time (and potentially anticipating future changes in resilience) is the UN Global Pulse’s project on food security in Indonesia. They partnered with Crimson Hexagon to forecast food prices in Indonesia by analyzing tweets referring to the price of rice. They found an inter-esting relationship between said tweets and government statistics on food price inflation. Some have described the rise of social media as a new nervous system for the planet, capturing the pulse of our social systems. My colleagues and I at QCRI are therefore in the process of appling this approach to the study of the Arabic Twittersphere. Incidentally, this is yet another critical reason why Open Data is so important (check out the work of OpenDRI, Open Data for Resilience Initiative. See also this post on Demo-cratizing ICT for Development with DIY Innovation and Open Data). More on open data and data philanthropy in the conclusion.

Finally, new technologies can also provide guidance on how to respond. Think of Foursquare but applied to disaster response. Instead of “Break Glass in Case of Emergency,” how about “Check-In in Case of Emergency”? Numerous smart-phone apps such as Waze already provide this kind of at-a-glance, real-time situational awareness. It is only a matter of time until humanitarian organiza-tions develop disaster response apps that will enable disaster-affected commu-nities to check-in for real time guidance on what to do given their current location and level of resilience. Several disaster preparedness apps already exist. Social computing and Big Data Analytics can power these apps in real-time.

Quick Recovery

As already noted, there are at least two ways create more resilient “social eco-systems”. We just discussed the first: resisting damage by absorbing and dam-pening the perturbation.  The second way to grow more resilient societies is by enabling them to rapidly recover following a disaster.

As Manyena writes, “increasing attention is now paid to the capacity of disaster-affected communities to ‘bounce back’ or to recover with little or no external assistance following a disaster.” So what factors accelerate recovery in eco-systems in general? In ecological terms, how quickly the damaged part of an ecosystem can repair itself depends on how many feedback loops it has to the non- (or less-) damaged parts of the ecosystem(s). These feedback loops are what enable adaptation and recovery. In social ecosystems, these feedback loops can be comprised of information in addition to the transfer of tangible resources.  As some scholars have argued, a disaster is first of all “a crisis in communicating within a community—that is, a difficulty for someone to get informed and to inform other people” (7).

Improving ways for local communities to communicate internally and externally is thus an important part of building more resilient societies. Indeed, as Homer-Dixon notes, “the part of the system that has been damaged recovers by drawing resources and information from undamaged parts.” Identifying needs following a disaster and matching them to available resources is an important part of the process. Indeed, accelerating the rate of (1) identification; (2) matching and, (3) allocation, are important ways to speed up overall recovery.

This explains why ICTs, social media and Big Data are central to growing more resilient societies. They can accelerate impact evaluations and needs assessments at the local level. Population displacement following disasters poses a serious public health risk. So rapidly identifying these risks can help affected populations recover more quickly. Take the work carried out by my colleagues at Flowminder, for example. They  empirically demonstrated that mobile phone data (Big Data!) can be used to predict population displacement after major disasters. Take also this study which analyzed call dynamics to demonstrate that telecommunications data could be used to rapidly assess the impact of earthquakes. A related study showed similar results when analyzing SMS’s and building damage Haiti after the 2010 earthquake.

haiti_overview_570

Resilience as Self-Organization and Emergence

Connection technologies such as mobile phones allow individual “grains of sand” in our societal “sand pile” to make necessary connections and decisions to self-organize and rapidly recover from disasters. With appropriate incentives, pre-paredness measures and policies, these local decisions can render a complex system more resilient. At the core here is behavior change and thus the importance of understanding behavior change models. Recall  also Thomas Schelling’s observation that micro-motives can lead to macro-behavior. To be sure, as Thomas Homer-Dixon rightly notes, “Resilience is an emergent property of a system—it’s not a result of any one of the system’s parts but of the synergy between all of its parts.  So as a rough and ready rule, boosting the ability of each part to take care of itself in a crisis boosts overall resilience.” (For complexity science readers, the notions of transforma-tion through phase transitions is relevant to this discussion).

In other words, “Resilience is the capacity of the affected community to self-organize, learn from and vigorously recover from adverse situations stronger than it was before” (8). This link between resilience and capacity for self-organization is very important, which explains why a recent and major evaluation of the 2010 Haiti Earthquake disaster response promotes the “attainment of self-sufficiency, rather than the ongoing dependency on standard humanitarian assistance.” Indeed, “focus groups indicated that solutions to help people help themselves were desired.”

The fact of the matter is that we are not all affected in the same way during a disaster. (Recall the distinction between hazards and disasters discussed earlier). Those of use who are less affected almost always want to help those in need. Herein lies the critical role of peer-to-peer feedback loops. To be sure, the speed at which the damaged part of an ecosystem can repair itself depends on how many feedback loops it has to the non- (or less-) damaged parts of the eco-system(s). These feedback loops are what enable adaptation and recovery.

Lastly, disaster response professionals cannot be every where at the same time. But the crowd is always there. Moreover, the vast majority of survivals following major disasters cannot be attributed to external aid. One study estimates that at most 10% of external aid contributes to saving lives. Why? Because the real first responders are the disaster-affected communities themselves, the local popula-tion. That is, the real first feedback loops are always local. This dynamic of mutual-aid facilitated by social media is certainly not new, however. My colleagues in Russia did this back in 2010 during the major forest fires that ravaged their country.

While I do have a bias towards people-centered interventions, this does not mean that I discount the importance of feedback loops to external actors such as traditional institutions and humanitarian organizations. I also don’t mean to romanticize the notion of “indigenous technical knowledge” or local coping mechanism. Some violate my own definition of human rights, for example. However, my bias stems from the fact that I am particularly interested in disaster resilience within the context of areas of limited statehood where said institutions and organizations are either absent are ineffective. But I certainly recognize the importance of scale jumping, particularly within the context of social capital and social media.

RESILIENCE THROUGH SOCIAL CAPITAL

Information-based feedback loops general social capital, and the latter has been shown to improve disaster resilience and recovery. In his recent book entitled “Building Resilience: Social Capital in Post-Disaster Recovery,” Daniel Aldrich draws on both qualitative and quantitative evidence to demonstrate that “social resources, at least as much as material ones, prove to be the foundation for resilience and recovery.” His case studies suggest that social capital is more important for disaster resilience than physical and financial capital, and more important than conventional explanations. So the question that naturally follows given our interest in resilience & technology is this: can social media (which is not restricted by geography) influence social capital?

Social Capital

Building on Daniel’s research and my own direct experience in digital humani-tarian response, I argue that social media does indeed nurture social capital during disasters. “By providing norms, information, and trust, denser social networks can implement a faster recovery.” Such norms also evolve on Twitter, as does information sharing and trust building. Indeed, “social ties can serve as informal insurance, providing victims with information, financial help and physical assistance.” This informal insurance, “or mutual assistance involves friends and neighbors providing each other with information, tools, living space, and other help.” Again, this bonding is not limited to offline dynamics but occurs also within and across online social networks. Recall the sand pile analogy. Social capital facilitates the transformation of the sand pile away (temporarily) from self-organized criticality. On a related note vis-a-vis open source software, “the least important part of open source software is the code.” Indeed, more important than the code is the fact that open source fosters social ties, networks, communities and thus social capital.

(Incidentally, social capital generated during disasters is social capital that can subsequently be used to facilitate self-organization for non-violent civil resistance and vice versa).

RESILIENCE through big data

My empirical research on tweets posted during disasters clearly shows that while many use twitter (and social media more generally) to post needs during a crisis, those who are less affected in the social ecosystem will often post offers to help. So where does Big Data fit into this particular equation? When disaster strikes, access to information is equally important as access to food and water. This link between information, disaster response and aid was officially recognized by the Secretary General of the International Federation of Red Cross & Red Crescent Societies in the World Disasters Report published in 2005. Since then, disaster-affected populations have become increasingly digital thanks to the very rapid and widespread adoption of mobile technologies. Indeed, as a result of these mobile technologies, affected populations are increasingly able to source, share and generate a vast amount of information, which is completely transforming disaster response.

In other words, disaster-affected communities are increasingly becoming the source of Big (Crisis) Data during and following major disasters. There were over 20 million tweets posted during Hurricane Sandy. And when the major earth-quake and Tsunami hit Japan in early 2011, over 5,000 tweets were being posted every secondThat is 1.5 million tweets every 5 minutes. So how can Big Data Analytics create more resilience in this respect? More specifically, how can Big Data Analytics accelerate disaster recovery? Manually monitoring millions of tweets per minute is hardly feasible. This explains why I often “joke” that we need a local Match.com for rapid disaster recovery. Thanks to social computing, artifi-cial intelligence, machine learning and Big Data Analytics, we can absolutely develop a “Match.com” for rapid recovery. In fact, I’m working on just such a project with my colleagues at QCRI. We are also developing algorithms to auto-matically identify informative and actionable information shared on Twitter, for example. (Incidentally, a by-product of developing a robust Match.com for disaster response could very well be an increase in social capital).

There are several other ways that advanced computing can create disaster resilience using Big Data. One major challenge is digital humanitarian response is the verification of crowdsourced, user-generated content. Indeed, misinforma-tion and rumors can be highly damaging. If access to information is tantamount to food access as noted by the Red Cross, then misinformation is like poisoned food. But Big Data Analytics has already shed some light on how to develop potential solutions. As it turns out, non-credible disaster information shared on Twitter propagates differently than credible information, which means that the credibility of tweets could be predicted automatically.

Conclusion

In sum, “resilience is the critical link between disaster and development; monitoring it [in real-time] will ensure that relief efforts are supporting, and not eroding […] community capabilities” (9). While the focus of this blog post has been on disaster resilience, I believe the insights provided are equally informa-tive for less extreme events.  So I’d like to end on two major points. The first has to do with data philanthropy while the second emphasizes the critical importance of failing gracefully.

Big Data is Closed and Centralized

A considerable amount of “Big Data” is Big Closed and Centralized Data. Flow-minder’s study mentioned above draws on highly proprietary telecommunica-tions data. Facebook data, which has immense potential for humanitarian response, is also closed. The same is true of Twitter data, unless you have millions of dollars to pay for access to the full Firehose, or even Decahose. While access to the Twitter API is free, the number of tweets that can be downloaded and analyzed is limited to several thousand a day. Contrast this with the 5,000 tweets per second posted after the earthquake and Tsunami in Japan. We therefore need some serious political will from the corporate sector to engage in “data philanthropy”. Data philanthropy involves companies sharing proprietary datasets for social good. Call it Corporate Social Responsibility (CRS) for digital humanitarian response. More here on how this would work.

Failing Gracefully

Lastly, on failure. As noted, complex systems tend towards instability, i.e., self-organized criticality, which is why Homer-Dixon introduces the notion of failing gracefully. “Somehow we have to find the middle ground between dangerous rigidity and catastrophic collapse.” He adds that:

“In our organizations, social and political systems, and individual lives, we need to create the possibility for what computer programmers and disaster planners call ‘graceful’ failure. When a system fails gracefully, damage is limited, and options for recovery are preserved. Also, the part of the system that has been damaged recovers by drawing resources and information from undamaged parts.” Homer-Dixon explains that “breakdown is something that human social systems must go through to adapt successfully to changing conditions over the long term. But if we want to have any control over our direction in breakdown’s aftermath, we must keep breakdown constrained. Reducing as much as we can the force of underlying tectonic stresses helps, as does making our societies more resilient. We have to do other things too, and advance planning for breakdown is undoubtedly the most important.”

As Louis Pasteur famously noted, “Chance favors the prepared mind.” Preparing for breakdown is not defeatist or passive. Quite on the contrary, it is wise and pro-active. Our hubris—including our current infatuation with Bid Data—all too often clouds our better judgment. Like Macbeth, rarely do we seriously ask our-selves what we would do “if we should fail.” The answer “then we fail” is an option. But are we truly prepared to live with the devastating consequences of total synchronous failure?

In closing, some lingering (less rhetorical) questions:

  • How can resilience can be measured? Is there a lowest common denominator? What is the “atom” of resilience?
  • What are the triggers of resilience, creative capacity, local improvisation, regenerative capacity? Can these be monitored?
  • Where do the concepts of “lived reality” and “positive deviance” enter the conversation on resilience?
  • Is resiliency a right? Do we bear a responsibility to render systems more resilient? If so, recalling that resilience is the capacity to self-organize, do local communities have the right to self-organize? And how does this differ from democratic ideals and freedoms?
  • Recent research in social-psychology has demonstrated that mindfulness is an amplifier of resilience for individuals? How can be scaled up? Do cultures and religions play a role here?
  • Collective memory influences resilience. How can this be leveraged to catalyze more regenerative social systems?

bio

Epilogue: Some colleagues have rightfully pointed out that resilience is ultima-tely political. I certainly share that view, which is why this point came up in recent conversations with my PopTech colleagues Andrew Zolli & Leetha Filderman. Readers of my post will also have noted my emphasis on distinguishing between hazards and disasters; that the latter are the product of social, economic and political processes. As noted in my blog post, there are no natural disastersTo this end, some academics rightly warn that “Resilience is a very technical, neutral, apolitical term. It was initially designed to characterize systems, and it doesn’t address power, equity or agency…  Also, strengthening resilience is not free—you can have some winners and some losers.”

As it turns out, I have a lot say about the political versus technical argument. First of all, this is hardly a new or original argument but nevertheless an important one. Amartya Senn discussed this issue within the context of famines decades ago, noting that famines do not take place in democracies. In 1997, Alex de Waal published his seminal book, “Famine Crimes: Politics and the Disaster Relief In-dustry in Africa.” As he rightly notes, “Fighting famine is both a technical and political challenge.” Unfortunately, “one universal tendency stands out: technical solutions are promoted at the expense of political ones.” There is also a tendency to overlook the politics of technical actions, muddle or cover political actions with technical ones, or worse, to use technical measures as an excuse not to undertake needed political action.

De Waal argues that the use of the term “governance” was “an attempt to avoid making the political critique too explicit, and to enable a focus on specific technical aspects of government.” In some evaluations of development and humanitarian projects, “a caveat is sometimes inserted stating that politics lies beyond the scope of this study.” To this end, “there is often a weak call for ‘political will’ to bridge the gap between knowledge of technical measures and action to implement them.” As de Waal rightly notes, “the problem is not a ‘missing link’ but rather an entire political tradition, one manifestation of which is contemporary international humanitarianism.” In sum, “technical ‘solutions’ must be seen in the political context, and politics itself in the light of the domi-nance of a technocratic approach to problems such as famine.”

From a paper I presented back in 2007: “the technological approach almost always serves those who seek control from a distance.” As a result of this technological drive for pole position, a related “concern exists due to the separation of risk evaluation and risk reduction between science and political decision” so that which is inherently politically complex becomes depoliticized and mechanized. In Toward a Rational Society (1970), the German philosopher Jürgen Habermas describes “the colonization of the public sphere through the use of instrumental technical rationality. In this sphere, complex social problems are reduced to technical questions, effectively removing the plurality of contending perspectives.”

To be sure, Western science tends to pose the question “How?” as opposed to “Why?”What happens then is that “early warning systems tend to be largely conceived as hazard-focused, linear, topdown, expert driven systems, with little or no engagement of end-users or their representatives.” As De Waal rightly notes, “the technical sophistication of early warning systems is offset by a major flaw: response cannot be enforced by the populace. The early warning information is not normally made public.”  In other words, disaster prevention requires “not merely identifying causes and testing policy instruments but building a [social and] political movement” since “the framework for response is inherently political, and the task of advocacy for such response cannot be separated from the analytical tasks of warning.”

Recall my emphasis on people-centered early warning above and the definition of resilience as capacity for self-organization. Self-organization is political. Hence my efforts to promote greater linkages between the fields of nonviolent action and early warning years ago. I have a paper (dated 2008) specifically on this topic should anyone care to read. Anyone who has read my doctoral dissertation will also know that I have long been interested in the impact of technology on the balance of power in political contexts. A relevant summary is available here. Now, why did I not include all this in the main body of my blog post? Because this updated section already runs over 1,000 words.

In closing, I disagree with the over-used criticism that resilience is reactive and about returning to initial conditions. Why would we want to be reactive or return to initial conditions if the latter state contributed to the subsequent disaster we are recovering from? When my colleague Andrew Zolli talks about resilience, he talks about “bouncing forward”, not bouncing back. This is also true of Nassim Taleb’s term antifragility, the ability to thrive on disruption. As Homer-Dixon also notes, preparing to fail gracefully is hardly reactive either.

Comparing the Quality of Crisis Tweets Versus 911 Emergency Calls

In 2010, I published this blog post entitled “Calling 911: What Humanitarians Can Learn from 50 Years of Crowdsourcing.” Since then, humanitarian colleagues have become increasingly open to the use of crowdsourcing as a methodology to  both collect and process information during disasters.  I’ve been studying the use of twitter in crisis situations and have been particularly interested in the quality, actionability and credibility of such tweets. My findings, however, ought to be placed in context and compared to other, more traditional, reporting channels, such as the use of official emergency telephone numbers. Indeed, “Information that is shared over 9-1-1 dispatch is all unverified information” (1).

911ex

So I did some digging and found the following statistics on 911 (US) & 999 (UK) emergency calls:

  • “An astounding 38% of some 10.4 million calls to 911 [in New York City] during 2010 involved such accidental or false alarm ‘short calls’ of 19 seconds or less — that’s an average of 10,700 false calls a day”.  – Daily News
  • “Last year, seven and a half million emergency calls were made to the police in Britain. But fewer than a quarter of them turned out to be real emergencies, and many were pranks or fakes. Some were just plain stupid.” – ABC News

I also came across the table below in this official report (PDF) published in 2011 by the European Emergency Number Association (EENA). The Greeks top the chart with a staggering 99% of all emergency calls turning out to be false/hoaxes, while Estonians appear to be holier than the Pope with less than 1% of such calls.

Screen Shot 2012-12-11 at 4.45.34 PM

Point being: despite these “data quality” issues, European law enforcement agencies have not abandoned the use of emergency phone numbers to crowd-source the reporting of emergencies. They are managing the challenge since the benefit of these number still far outweigh the costs. This calculus is unlikely to change as law enforcement agencies shift towards more mobile-based solutions like the use of SMS for 911 in the US. This important shift may explain why tra-ditional emergency response outfits—such as London’s Fire Brigade—are putting in place processes that will enable the public to report via Twitter.

For more information on the verification of crowdsourced social media informa-tion for disaster response, please follow this link.

Using E-Mail Data to Estimate International Migration Rates

As is well known, “estimates of demographic flows are inexistent, outdated, or largely inconsistent, for most countries.” I would add costly to that list as well. So my QCRI colleague Ingmar Weber co-authored a very interesting study on the use of e-mail data to estimate international migration rates.

The study analyzes a large sample of Yahoo! emails sent by 43 million users between September 2009 and June 2011. “For each message, we know the date when it was sent and the geographic location from where it was sent. In addition, we could link the message with the person who sent it, and with the user’s demographic information (date of birth and gender), that was self reported when he or she signed up for a Yahoo! account. We estimated the geographic location from where each email message was sent using the IP address of the user.”

The authors used data on existing migration rates for a dozen countries and international statistics on Internet diffusion rates by age and gender in order to correct for selection bias. For example, “estimated number of migrants, by age group and gender, is multiplied by a correction factor to adjust for over-representation of more educated and mobile people in groups for which the Internet penetration is low.” The graphs below are estimates of age and gender-specific immigration rates for the Philippines. “The gray area represents the size of the bias correction.” This means that “without any correction for bias, the point estimates would be at the upper end of the gray area.” These methods “correct for the fact that the group of users in the sample, although very large, is not representative of the entire population.”

The results? Ingmar and his co-author Emilio Zagheni were able to “estimate migration rates that are consistent with the ones published by those few countries that compile migration statistics. By using the same method for all geographic regions, we obtained country statistics in a consistent way, and we generated new information for those countries that do not have registration systems in place (e.g., developing countries), or that do not collect data on out-migration (e.g., the United States).” Overall, the study documented a “global trend of increasing mobility,” which is “growing at a faster pace for females than males. The rate of increase for different age groups varies across countries.”

The authors argue that this approach could also be used in the context of “natural” disasters and man-made disasters. In terms of future research, they are interested in evaluating “whether sending a high proportion of e-mail messages to a particular country (which is a proxy for having a strong social network in the country) is related to the decision of actually moving to the country.” Naturally, they are also interested in analyzing Twitter data. “In addition to mobility or migration rates, we could evaluate sentiments pro or against migration for different geographic areas. This would help us understand how sentiments change near an international border or in regions with different migration rates and economic conditions.”

I’m very excited to have Ingmar at QCRI so we can explore these ideas further and in the context of humanitarian and development challenges. I’ve been dis-cussing similar research ideas with my colleagues at UN Global Pulse and there may be a real sweet spot for collaboration here, particularly with the recently launched Pulse Lab in Jakarta.” The possibility of collaborating with my collea-gues at Flowminder could also be really interesting given their important study of population movement following the Haiti Earthquake. In conclusion, I fully share the authors’ sentiment when they highlight the fact that it is “more and more important to develop models for data sharing between private com-panies and the academic world, that allow for both protection of users’ privacy & private companies’ interests, as well as reproducibility in scientific publishing.”

Become a (Social Media) Data Donor and Save a Life

I was recently in New York where I met up with my colleague Fernando Diaz from Microsoft Research. We were discussing the uses of social media in humanitarian crises and the various constraints of social media platforms like Twitter vis-a-vis their Terms of Service. And then this occurred to me: we have organ donation initiatives and organ donor cards that many of us carry around in our wallets. So why not become a “Data Donor” as well in the event of an emergency? After all, it has long been recognized that access to information during a crisis is as important as access to food, water, shelter and medical aid.

This would mean having a setting that gives others during a crisis the right (for a limited time) to use your public tweets or Facebook status updates for the ex-pressed purpose of supporting emergency response operations, such as live crisis maps. Perhaps switching this setting on would also come with the provision that the user confirms that s/he will not knowingly spread false or misleading information as part of their data donation. Of course, the other option is to simply continue doing what many have been doing all along, i.e., keep using social media updates for humanitarian response regardless of whether or not they violate the various Terms of Service.

Big Data for Development: Challenges and Opportunities

The UN Global Pulse report on Big Data for Development ought to be required reading for anyone interested in humanitarian applications of Big Data. The purpose of this post is not to summarize this excellent 50-page document but to relay the most important insights contained therein. In addition, I question the motivation behind the unbalanced commentary on Haiti, which is my only major criticism of this otherwise authoritative report.

Real-time “does not always mean occurring immediately. Rather, “real-time” can be understood as information which is produced and made available in a relatively short and relevant period of time, and information which is made available within a timeframe that allows action to be taken in response i.e. creating a feedback loop. Importantly, it is the intrinsic time dimensionality of the data, and that of the feedback loop that jointly define its characteristic as real-time. (One could also add that the real-time nature of the data is ultimately contingent on the analysis being conducted in real-time, and by extension, where action is required, used in real-time).”

Data privacy “is the most sensitive issue, with conceptual, legal, and technological implications.” To be sure, “because privacy is a pillar of democracy, we must remain alert to the possibility that it might be compromised by the rise of new technologies, and put in place all necessary safeguards.” Privacy is defined by the International Telecommunications Union as theright of individuals to control or influence what information related to them may be disclosed.” Moving forward, “these concerns must nurture and shape on-going debates around data privacy in the digital age in a constructive manner in order to devise strong principles and strict rules—backed by adequate tools and systems—to ensure “privacy-preserving analysis.”

Non-representative data is often dismissed outright since findings based on such data cannot be generalized beyond that sample. “But while findings based on non-representative datasets need to be treated with caution, they are not valueless […].” Indeed, while the “sampling selection bias can clearly be a challenge, especially in regions or communities where technological penetration is low […],  this does not mean that the data has no value. For one, data from “non-representative” samples (such as mobile phone users) provide representative information about the sample itself—and do so in close to real time and on a potentially large and growing scale, such that the challenge will become less and less salient as technology spreads across and within developing countries.”

Perceptions rather than reality is what social media captures. Moreover, these perceptions can also be wrong. But only those individuals “who wrongfully assume that the data is an accurate picture of reality can be deceived. Furthermore, there are instances where wrong perceptions are precisely what is desirable to monitor because they might determine collective behaviors in ways that can have catastrophic effects.” In other words, “perceptions can also shape reality. Detecting and understanding perceptions quickly can help change outcomes.”

False data and hoaxes are part and parcel of user-generated content. While the challenges around reliability and verifiability are real, Some media organizations, such as the BBC, stand by the utility of citizen reporting of current events: “there are many brave people out there, and some of them are prolific bloggers and Tweeters. We should not ignore the real ones because we were fooled by a fake one.” And have thus devised internal strategies to confirm the veracity of the information they receive and chose to report, offering an example of what can be done to mitigate the challenge of false information.” See for example my 20-page study on how to verify crowdsourced social media data, a field I refer to as information forensics. In any event, “whether false negatives are more or less problematic than false positives depends on what is being monitored, and why it is being monitored.”

“The United States Geological Survey (USGS) has developed a system that monitors Twitter for significant spikes in the volume of messages about earthquakes,” and as it turns out, 90% of user-generated reports that trigger an alert have turned out to be valid. “Similarly, a recent retrospective analysis of the 2010 cholera outbreak in Haiti conducted by researchers at Harvard Medical School and Children’s Hospital Boston demonstrated that mining Twitter and online news reports could have provided health officials a highly accurate indication of the actual spread of the disease with two weeks lead time.”

This leads to the other Haiti example raised in the report, namely the finding that SMS data was correlated with building damage. Please see my previous blog posts here and here for context. What the authors seem to overlook is that Benetech apparently did not submit their counter-findings for independent peer-review whereas the team at the European Commission’s Joint Research Center did—and the latter passed the peer-review process. Peer-review is how rigorous scientific work is validated. The fact that Benetech never submitted their blog post for peer-review is actually quite telling.

In sum, while this Big Data report is otherwise strong and balanced, I am really surprised that they cite a blog post as “evidence” while completely ignoring the JRC’s peer-reviewed scientific paper published in the Journal of the European Geosciences Union. Until counter-findings are submitted for peer review, the JRC’s results stand: unverified, non-representative crowd-sourced text messages from the disaster affected population in Port-au-Prince that were in turn translated from Haitian Creole to English via a novel crowdsourced volunteer effort and subsequently geo-referenced by hundreds of volunteers  which did not undergo any quality control, produced a statistically significant, positive correlation with building damage.

In conclusion, “any challenge with utilizing Big Data sources of information cannot be assessed divorced from the intended use of the information. These new, digital data sources may not be the best suited to conduct airtight scientific analysis, but they have a huge potential for a whole range of other applications that can greatly affect development outcomes.”

One such application is disaster response. Earlier this year, FEMA Administrator Craig Fugate, gave a superb presentation on “Real Time Awareness” in which he relayed an example of how he and his team used Big Data (twitter) during a series of devastating tornadoes in 2011:

“Mr. Fugate proposed dispatching relief supplies to the long list of locations immediately and received pushback from his team who were concerned that they did not yet have an accurate estimate of the level of damage. His challenge was to get the staff to understand that the priority should be one of changing outcomes, and thus even if half of the supplies dispatched were never used and sent back later, there would be no chance of reaching communities in need if they were in fact suffering tornado damage already, without getting trucks out immediately. He explained, “if you’re waiting to react to the aftermath of an event until you have a formal assessment, you’re going to lose 12-to-24 hours…Perhaps we shouldn’t be waiting for that. Perhaps we should make the assumption that if something bad happens, it’s bad. Speed in response is the most perishable commodity you have…We looked at social media as the public telling us enough information to suggest this was worse than we thought and to make decisions to spend [taxpayer] money to get moving without waiting for formal request, without waiting for assessments, without waiting to know how bad because we needed to change that outcome.”

“Fugate also emphasized that using social media as an information source isn’t a precise science and the response isn’t going to be precise either. “Disasters are like horseshoes, hand grenades and thermal nuclear devices, you just need to be close— preferably more than less.”

Big Data Philanthropy for Humanitarian Response

My colleague Robert Kirkpatrick from Global Pulse has been actively promoting the concept of “data philanthropy” within the context of development. Data philanthropy involves companies sharing proprietary datasets for social good. I believe we urgently need big (social) data philanthropy for humanitarian response as well. Disaster-affected communities are increasingly the source of big data, which they generate and share via social media platforms like twitter. Processing this data manually, however, is very time consuming and resource intensive. Indeed, large numbers of digital humanitarian volunteers are often needed to monitor and process user-generated content from disaster-affected communities in near real-time.

Meanwhile, companies like Crimson Hexagon, Geofeedia, NetBase, Netvibes, RecordedFuture and Social Flow are defining the cutting edge of automated methods for media monitoring and analysis. So why not set up a Big Data Philanthropy group for humanitarian response in partnership with the Digital Humanitarian Network? Call it Corporate Social Responsibility (CRS) for digital humanitarian response. These companies would benefit from the publicity of supporting such positive and highly visible efforts. They would also receive expert feedback on their tools.

This “Emergency Access Initiative” could be modeled along the lines of the International Charter whereby certain criteria vis-a-vis the disaster would need to be met before an activation request could be made to the Big Data Philanthropy group for humanitarian response. These companies would then provide a dedicated account to the Digital Humanitarian Network (DHNet). These accounts would be available for 72 hours only and also be monitored by said companies to ensure they aren’t being abused. We would simply need to  have relevant members of the DHNet trained on these platforms and draft the appropriate protocols, data privacy measures and MoUs.

I’ve had preliminary conversations with humanitarian colleagues from the United Nations and DHnet who confirm that “this type of collaboration would be see very positively from the coordination area within the traditional humanitarian sector.” On the business development end, this setup would enable companies to get their foot in the door of the humanitarian sector—a multi-billion dollar industry. Members of the DHNet are early adopters of humanitarian technology and are ideally placed to demonstrate the added value of these platforms since they regularly partner with large humanitarian organizations. Indeed, DHNet operates as a partnership model. This would enable humanitarian professionals to learn about new Big Data tools, see them in action and, possibly, purchase full licenses for their organizations. In sum, data philanthropy is good for business.

I have colleagues at most of the companies listed above and thus plan to actively pursue this idea further. In the meantime, I’d be very grateful for any feedback and suggestions, particularly on the suggested protocols and MoUs. So I’ve set up this open and editable Google Doc for feedback.

Big thanks to the team at the Disaster Information Management Research Center (DIMRC) for planting the seeds of this idea during our recent meeting. Check out their very neat Emergency Access Initiative.

The KoBo Platform: Data Collection for Real Practitioners

Update: be sure to check out the excellent points in the comments section below.

I recently visited my alma mater, the Harvard Humanitarian Initiative (HHI), where I learned more about the free and open source KoBo ToolBox project that my colleagues Phuong Pham, Patrick Vinck and John Etherton have been working on. What really attracts me about KoBo, which means transfer in Acholi, is that the entire initiative is driven by highly experienced and respec-ted practitioners. Often, software developers are the ones who build these types of platforms in the hopes that they add value to the work of practitioners. In the case of KoBo, a team of seasoned practitioners are fully in the drivers seat. The result is a highly dedicated, customized and relevant solution.

Phuong and Patrick first piloted handheld digital data collection in 2007 in Northern Uganda. This early experience informed the development of KoBo which continues to be driven by actual field-based needs and challenges such as limited technical know-how. In short, KoBo provides an integrated suite of applications for handheld data collection that are specifically designed for a non-technical audience, ie., the vast majority of human rights and humanitarian practitioners out there. This suite of applications enable users to collect and analyze field data in virtually real-time.

KoBoForm allows you to build multimedia surveys for data collection purposes, integrating special datatypes like bar-codes, images and audio. Time stamps and geo-location via GPS let you know exactly where and when the data was collected (important for monitoring and evaluation, for example). KoBoForm’s optional data constraints and skip logic further ensure data accuracy. KoBoCollect is an Android-based app based on ODK. Surveys built with KoBoForm are easily uploaded to any number of Android phones sporting the KoBoCollect app, which can also be used offline and automatically synched when back in range. KoBoSync pushes survey data from the Android(s) to your computer for data analysis while KoBoMap lets you display your results in an interactive map with a user-friendly interface. Importantly, KoBoMap is optimized for low-bandwidth connections.

The KoBo platform has been used in to conduct large scale population studies in places like the Central African Republic, Northern Uganda and Liberia. In total, Phuong and Patrick have interviewed more than 25,000 individuals in these countries using KoBo, so the tool has certainly been tried and tested. The resulting data, by the way, is available via this data-visualization portal. The team is  currently building new features for KoBo to apply the tool in the Democratic Republic of the Congo (DRC). They are also collaborating with UNDP to develop a judicial monitoring project in the DRC using KoBoToolbox, which will help them “think through some of the requirements for longitudinal data collection and tracking of cases.”

In sum, the expert team behind KoBo is building these software solutions first and foremost for their own field work. As Patrick notes here, “the use of these tools was instrumental to the success of many of our projects.” This makes all the difference vis-a-vis the resulting technology.

Crisis Mapping Syria: Automated Data Mining and Crowdsourced Human Intelligence

The Syria Tracker Crisis Map is without doubt one of the most impressive crisis mapping projects yet. Launched just a few weeks after the protests began one year ago, the crisis map is spearheaded by a just handful of US-based Syrian activists have meticulously and systematically documented 1,529 reports of human rights violations including a total of 11,147 killings. As recently reported in this NewScientist article, “Mapping the Human Cost of Syria’s Uprising,” the crisis map “could be the most accurate estimate yet of the death toll in Syria’s uprising […].” Their approach? “A combination of automated data mining and crowdsourced human intelligence,” which “could provide a powerful means to assess the human cost of wars and disasters.”

On the data-mining side, Syria Tracker has repurposed the HealthMap platform, which mines thousands of online sources for the purposes of disease detection and then maps the results, “giving public-health officials an easy way to monitor local disease conditions.” The customized version of this platform for Syria Tracker (ST), known as HealthMap Crisis, mines English information sources for evidence of human rights violations, such as killings, torture and detainment. As the ST Team notes, their data mining platform “draws from a broad range of sources to reduce reporting biases.” Between June 2011 and January 2012, for example, the platform collected over 43,o00 news articles and blog posts from almost 2,000 English-based sources from around the world (including some pro-regime sources).

Syria Tracker combines the results of this sophisticated data mining approach with crowdsourced human intelligence, i.e., field-based eye-witness reports shared via webform, email, Twitter, Facebook, YouTube and voicemail. This naturally presents several important security issues, which explains why the main ST website includes an instructions page detailing security precautions that need to be taken while sub-mitting reports from within Syria. They also link to this practical guide on how to protect your identity and security online and when using mobile phones. The guide is available in both English and Arabic.

Eye-witness reports are subsequently translated, geo-referenced, coded and verified by a group of volunteers who triangulate the information with other sources such as those provided by the HealthMap Crisis platform. They also filter the reports and remove dupli-cates. Reports that have a low con-fidence level vis-a-vis veracity are also removed. Volunteers use a dig-up or vote-up/vote-down feature to “score” the veracity of eye-witness reports. Using this approach, the ST Team and their volunteers have been able to verify almost 90% of the documented killings mapped on their platform thanks to video and/or photographic evidence. They have also been able to associate specific names to about 88% of those reported killed by Syrian forces since the uprising began.

Depending on the levels of violence in Syria, the turn-around time for a report to be mapped on Syria Tracker is between 1-3 days. The team also produces weekly situation reports based on the data they’ve collected along with detailed graphical analysis. KML files that can be uploaded and viewed using Google Earth are also made available on a regular basis. These provide “a more precisely geo-located tally of deaths per location.”

In sum, Syria Tracker is very much breaking new ground vis-a-vis crisis mapping. They’re combining automated data mining technology with crowdsourced eye-witness reports from Syria. In addition, they’ve been doing this for a year, which makes the project the longest running crisis maps I’ve seen in a hostile environ-ment. Moreover, they’ve been able to sustain these import efforts with just a small team of volunteers. As for the veracity of the collected information, I know of no other public effort that has taken such a meticulous and rigorous approach to documenting the killings in Syria in near real-time. On February 24th, Al-Jazeera posted the following estimates:

Syrian Revolution Coordination Union: 9,073 deaths
Local Coordination Committees: 8,551 deaths
Syrian Observatory for Human Rights: 5,581 deaths

At the time, Syria Tracker had a total of 7,901 documented killings associated with specific names, dates and locations. While some duplicate reports may remain, the team argues that “missing records are a much bigger source of error.” Indeed, They believe that “the higher estimates are more likely, even if one chooses to disregard those reports that came in on some of the most violent days where names were not always recorded.”

The Syria Crisis Map itself has been viewed by visitors from 136 countries around the world and 2,018 cities—with the top 3 cities being Damascus, Washington DC and, interestingly, Riyadh, Saudia Arabia. The witnessing has thus been truly global and collective. When the Syrian regime falls, “the data may help sub-sequent governments hold him and other senior leaders to account,” writes the New Scientist. This was one of the principle motivations behind the launch of the Ushahidi platform in Kenya over four years ago. Syria Tracker is powered by Ushahidi’s cloud-based platform, Crowdmap. Finally, we know for a fact that the International Criminal Court (ICC) and Amnesty International (AI) closely followed the Libya Crisis Map last year.

Crisis Mapping Climate Change, Conflict and Aid in Africa

I recently gave a guest lecture at the University of Texas, Austin, and finally had the opportunity to catch up with my colleague Josh Busby who has been working on a promising crisis mapping project as part of the university’s Climate Change and African Political Stability Program (CCAPS).

Josh and team just released the pilot version of its dynamic mapping tool, which aims to provide the most comprehensive view yet of climate change and security in Africa. The platform, developed in partnership with AidData, enables users to “visualize data on climate change vulnerability, conflict, and aid, and to analyze how these issues intersect in Africa.” The tool is powered by ESRI technology and allows researchers as well as policymakers to “select and layer any combination of CCAPS data onto one map to assess how myriad climate change impacts and responses intersect. For example, mapping conflict data over climate vulnera-bility data can assess how local conflict patterns could exacerbate climate-induced insecurity in a region. It also shows how conflict dynamics are changing over time and space.”

The platform provides hyper-local data on climate change and aid-funded interventions, which can provide important insights on how development assistance might (or might not) be reducing vulnerability. For example, aid projects funded by 27 donors in Malawi (i.e., aid flows) can be layered on top of the climate change vulnerability data to “discern whether adaptation aid is effectively targeting the regions where climate change poses the most significant risk to the sustainable development and political stability of a country.”

If this weren’t impressive enough, I was positively amazed when I learned from Josh and team that the conflict data they’re using, the Armed Conflict Location Event Data (ACLED), will be updated on a weekly basis as part of this project, which is absolutely stunning. Back in the day, ACLED was specifically coding historical data. A few years ago they closed the gap by updating some conflict data on a yearly basis. Now the temporal lag will just be one week. Note that the mapping tool already draws on the Social Conflict in Africa Database (SCAD).

This project is an important contribution to the field of crisis mapping and I look forward to following CCAPS’s progress closely over the next few months. I’m hoping that Josh will present this project at the 2012 International Crisis Mappers Conference (ICCM 2012) later this year.

Trails of Trustworthiness in Real-Time Streams

Real-time information channels like Twitter, Facebook and Google have created cascades of information that are becoming increasingly challenging to navigate. “Smart-filters” alone are not the solution since they won’t necessarily help us determine the quality and trustworthiness of the information we receive. I’ve been studying this challenge ever since the idea behind SwiftRiver first emerged several years ago now.

I was thus thrilled to come across a short paper on “Trails of Trustworthiness in Real-Time Streams” which describes a start-up project that aims to provide users with a “system that can maintain trails of trustworthiness propagated through real-time information channels,” which will “enable its educated users to evaluate its provenance, its credibility and the independence of the multiple sources that may provide this information.” The authors, Panagiotis Metaxas and Eni Mustafaraj, kindly cite my paper on “Information Forensics” and also reference SwiftRiver in their conclusion.

The paper argues that studying the tactics that propagandists employ in real life can provide insights and even predict the tricks employed by Web spammers.

“To prove the strength of this relationship between propagandistic and spamming techniques, […] we show that one can, in fact, use anti-propagandistic techniques to discover Web spamming networks. In particular, we demonstrate that when starting from an initial untrustworthy site, backwards propagation of distrust (looking at the graph defined by links pointing to to an untrustworthy site) is a successful approach to finding clusters of spamming, untrustworthy sites. This approach was inspired by the social behavior associated with distrust: in society, recognition of an untrustworthy entity (person, institution, idea, etc) is reason to question the trust- worthiness of those who recommend it. Other entities that are found to strongly support untrustworthy entities become less trustworthy themselves. As in society, distrust is also propagated backwards on the Web graph.”

The authors document that today’s Web spammers are using increasingly sophisticated tricks.

“In cases where there are high stakes, Web spammers’ influence may have important consequences for a whole country. For example, in the 2006 Congressional elections, activists using Google bombs orchestrated an effort to game search engines so that they present information in the search results that was unfavorable to 50 targeted candidates. While this was an operation conducted in the open, spammers prefer to work in secrecy so that their actions are not revealed. So,  revealed and documented the first Twitter bomb, which tried to influence the Massachusetts special elections, show- ing how an Iowa-based political group, hiding its affiliation and profile, was able to serve misinformation a day before the election to more than 60,000 Twitter users that were follow- ing the elections. Very recently we saw an increase in political cybersquatting, a phenomenon we reported in [28]. And even more recently, […] we discovered the existence of Pre-fabricated Twitter factories, an effort to provide collaborators pre-compiled tweets that will attack members of the Media while avoiding detection of automatic spam algorithms from Twitter.

The theoretical foundations for a trustworthiness system:

“Our concept of trustworthiness comes from the epistemology of knowledge. When we believe that some piece of information is trustworthy (e.g., true, or mostly true), we do so for intrinsic and/or extrinsic reasons. Intrinsic reasons are those that we acknowledge because they agree with our own prior experience or belief. Extrinsic reasons are those that we accept because we trust the conveyor of the information. If we have limited information about the conveyor of information, we look for a combination of independent sources that may support the information we receive (e.g., we employ “triangulation” of the information paths). In the design of our system we aim to automatize as much as possible the process of determining the reasons that support the information we receive.”

“We define as trustworthy, information that is deemed reliable enough (i.e., with some probability) to justify action by the receiver in the future. In other words, trustworthiness is observable through actions.”

“The overall trustworthiness of the information we receive is determined by a linear combination of (a) the reputation RZ of the original sender Z, (b) the credibility we associate with the contents of the message itself C(m), and (c) characteristics of the path that the message used to reach us.”

“To compute the trustworthiness of each message from scratch is clearly a huge task. But the research that has been done so far justifies optimism in creating a semi-automatic, personalized tool that will help its users make sense of the information they receive. Clearly, no such system exists right now, but components of our system do exist in some of the popular [real-time information channels]. For a testing and evaluation of our system we plan to use primarily Twitter, but also real-time Google results and Facebook.”

In order to provide trails of trustworthiness in real-time streams, the authors plan to address the following challenges:

•  “Establishment of new metrics that will help evaluate the trustworthiness of information people receive, especially from real-time sources, which may demand immediate attention and action. […] we show that coverage of a wider range of opinions, along with independence of results’ provenance, can enhance the quality of organic search results. We plan to extend this work in the area of real-time information so that it does not rely on post-processing procedures that evaluate quality, but on real-time algorithms that maintain a trail of trustworthiness for every piece of information the user receives.”

• “Monitor the evolving ways in which information reaches users, in particular citizens near election time.”

•  “Establish a personalizable model that captures the parameters involved in the determination of trustworthiness of in- formation in real-time information channels, such as Twitter, extending the work of measuring quality in more static information channels, and by applying machine learning and data mining algorithms. To implement this task, we will design online algorithms that support the determination of quality via the maintenance of trails of trustworthiness that each piece of information carries with it, either explicitly or implicitly. Of particular importance, is that these algorithms should help maintain privacy for the user’s trusting network.”

• “Design algorithms that can detect attacks on [real-time information channels]. For example we can automatically detect bursts of activity re- lated to a subject, source, or non-independent sources. We have already made progress in this area. Recently, we advised and provided data to a group of researchers at Indiana University to help them implement “truthy”, a site that monitors bursty activity on Twitter.  We plan to advance, fine-tune and automate this process. In particular, we will develop algorithms that calculate the trust in an information trail based on a score that is affected by the influence and trustworthiness of the informants.”

In conclusion, the authors “mention that in a month from this writing, Ushahidi […] plans to release SwiftRiver, a platform that ‘enables the filtering and verification of real-time data from channels like Twitter, SMS, Email and RSS feeds’. Several of the features of Swift River seem similar to what we propose, though a major difference appears to be that our design is personalization at the individual user level.”

Indeed, having been involved in SwiftRiver research since early 2009 and currently testing the private beta, there are important similarities and some differences. But one such difference is not personalization. Indeed, Swift allows full personalization at the individual user level.

Another is that we’re hoping to go beyond just text-based information with Swift, i.e., we hope to pull in pictures and video footage (in addition to Tweets, RSS feeds, email, SMS, etc) in order to cross-validate information across media, which we expect will make the falsification of crowdsourced information more challenging, as I argue here. In any case, I very much hope that the system being developed by the authors will be free and open source so that integration might be possible.

A copy of the paper is available here (PDF). I hope to meet the authors at the Berkman Center’s “Truth in Digital Media Symposium” and highly recommend the wiki they’ve put together with additional resources. I’ve added the majority of my research on verification of crowdsourced information to that wiki, such as my 20-page study on “Information Forensics: Five Case Studies on How to Verify Crowdsourced Information from Social Media.”