Category Archives: Social Computing

Big Data for Development: Challenges and Opportunities

The UN Global Pulse report on Big Data for Development ought to be required reading for anyone interested in humanitarian applications of Big Data. The purpose of this post is not to summarize this excellent 50-page document but to relay the most important insights contained therein. In addition, I question the motivation behind the unbalanced commentary on Haiti, which is my only major criticism of this otherwise authoritative report.

Real-time “does not always mean occurring immediately. Rather, “real-time” can be understood as information which is produced and made available in a relatively short and relevant period of time, and information which is made available within a timeframe that allows action to be taken in response i.e. creating a feedback loop. Importantly, it is the intrinsic time dimensionality of the data, and that of the feedback loop that jointly define its characteristic as real-time. (One could also add that the real-time nature of the data is ultimately contingent on the analysis being conducted in real-time, and by extension, where action is required, used in real-time).”

Data privacy “is the most sensitive issue, with conceptual, legal, and technological implications.” To be sure, “because privacy is a pillar of democracy, we must remain alert to the possibility that it might be compromised by the rise of new technologies, and put in place all necessary safeguards.” Privacy is defined by the International Telecommunications Union as theright of individuals to control or influence what information related to them may be disclosed.” Moving forward, “these concerns must nurture and shape on-going debates around data privacy in the digital age in a constructive manner in order to devise strong principles and strict rules—backed by adequate tools and systems—to ensure “privacy-preserving analysis.”

Non-representative data is often dismissed outright since findings based on such data cannot be generalized beyond that sample. “But while findings based on non-representative datasets need to be treated with caution, they are not valueless […].” Indeed, while the “sampling selection bias can clearly be a challenge, especially in regions or communities where technological penetration is low […],  this does not mean that the data has no value. For one, data from “non-representative” samples (such as mobile phone users) provide representative information about the sample itself—and do so in close to real time and on a potentially large and growing scale, such that the challenge will become less and less salient as technology spreads across and within developing countries.”

Perceptions rather than reality is what social media captures. Moreover, these perceptions can also be wrong. But only those individuals “who wrongfully assume that the data is an accurate picture of reality can be deceived. Furthermore, there are instances where wrong perceptions are precisely what is desirable to monitor because they might determine collective behaviors in ways that can have catastrophic effects.” In other words, “perceptions can also shape reality. Detecting and understanding perceptions quickly can help change outcomes.”

False data and hoaxes are part and parcel of user-generated content. While the challenges around reliability and verifiability are real, Some media organizations, such as the BBC, stand by the utility of citizen reporting of current events: “there are many brave people out there, and some of them are prolific bloggers and Tweeters. We should not ignore the real ones because we were fooled by a fake one.” And have thus devised internal strategies to confirm the veracity of the information they receive and chose to report, offering an example of what can be done to mitigate the challenge of false information.” See for example my 20-page study on how to verify crowdsourced social media data, a field I refer to as information forensics. In any event, “whether false negatives are more or less problematic than false positives depends on what is being monitored, and why it is being monitored.”

“The United States Geological Survey (USGS) has developed a system that monitors Twitter for significant spikes in the volume of messages about earthquakes,” and as it turns out, 90% of user-generated reports that trigger an alert have turned out to be valid. “Similarly, a recent retrospective analysis of the 2010 cholera outbreak in Haiti conducted by researchers at Harvard Medical School and Children’s Hospital Boston demonstrated that mining Twitter and online news reports could have provided health officials a highly accurate indication of the actual spread of the disease with two weeks lead time.”

This leads to the other Haiti example raised in the report, namely the finding that SMS data was correlated with building damage. Please see my previous blog posts here and here for context. What the authors seem to overlook is that Benetech apparently did not submit their counter-findings for independent peer-review whereas the team at the European Commission’s Joint Research Center did—and the latter passed the peer-review process. Peer-review is how rigorous scientific work is validated. The fact that Benetech never submitted their blog post for peer-review is actually quite telling.

In sum, while this Big Data report is otherwise strong and balanced, I am really surprised that they cite a blog post as “evidence” while completely ignoring the JRC’s peer-reviewed scientific paper published in the Journal of the European Geosciences Union. Until counter-findings are submitted for peer review, the JRC’s results stand: unverified, non-representative crowd-sourced text messages from the disaster affected population in Port-au-Prince that were in turn translated from Haitian Creole to English via a novel crowdsourced volunteer effort and subsequently geo-referenced by hundreds of volunteers  which did not undergo any quality control, produced a statistically significant, positive correlation with building damage.

In conclusion, “any challenge with utilizing Big Data sources of information cannot be assessed divorced from the intended use of the information. These new, digital data sources may not be the best suited to conduct airtight scientific analysis, but they have a huge potential for a whole range of other applications that can greatly affect development outcomes.”

One such application is disaster response. Earlier this year, FEMA Administrator Craig Fugate, gave a superb presentation on “Real Time Awareness” in which he relayed an example of how he and his team used Big Data (twitter) during a series of devastating tornadoes in 2011:

“Mr. Fugate proposed dispatching relief supplies to the long list of locations immediately and received pushback from his team who were concerned that they did not yet have an accurate estimate of the level of damage. His challenge was to get the staff to understand that the priority should be one of changing outcomes, and thus even if half of the supplies dispatched were never used and sent back later, there would be no chance of reaching communities in need if they were in fact suffering tornado damage already, without getting trucks out immediately. He explained, “if you’re waiting to react to the aftermath of an event until you have a formal assessment, you’re going to lose 12-to-24 hours…Perhaps we shouldn’t be waiting for that. Perhaps we should make the assumption that if something bad happens, it’s bad. Speed in response is the most perishable commodity you have…We looked at social media as the public telling us enough information to suggest this was worse than we thought and to make decisions to spend [taxpayer] money to get moving without waiting for formal request, without waiting for assessments, without waiting to know how bad because we needed to change that outcome.”

“Fugate also emphasized that using social media as an information source isn’t a precise science and the response isn’t going to be precise either. “Disasters are like horseshoes, hand grenades and thermal nuclear devices, you just need to be close— preferably more than less.”

Big Data Philanthropy for Humanitarian Response

My colleague Robert Kirkpatrick from Global Pulse has been actively promoting the concept of “data philanthropy” within the context of development. Data philanthropy involves companies sharing proprietary datasets for social good. I believe we urgently need big (social) data philanthropy for humanitarian response as well. Disaster-affected communities are increasingly the source of big data, which they generate and share via social media platforms like twitter. Processing this data manually, however, is very time consuming and resource intensive. Indeed, large numbers of digital humanitarian volunteers are often needed to monitor and process user-generated content from disaster-affected communities in near real-time.

Meanwhile, companies like Crimson Hexagon, Geofeedia, NetBase, Netvibes, RecordedFuture and Social Flow are defining the cutting edge of automated methods for media monitoring and analysis. So why not set up a Big Data Philanthropy group for humanitarian response in partnership with the Digital Humanitarian Network? Call it Corporate Social Responsibility (CRS) for digital humanitarian response. These companies would benefit from the publicity of supporting such positive and highly visible efforts. They would also receive expert feedback on their tools.

This “Emergency Access Initiative” could be modeled along the lines of the International Charter whereby certain criteria vis-a-vis the disaster would need to be met before an activation request could be made to the Big Data Philanthropy group for humanitarian response. These companies would then provide a dedicated account to the Digital Humanitarian Network (DHNet). These accounts would be available for 72 hours only and also be monitored by said companies to ensure they aren’t being abused. We would simply need to  have relevant members of the DHNet trained on these platforms and draft the appropriate protocols, data privacy measures and MoUs.

I’ve had preliminary conversations with humanitarian colleagues from the United Nations and DHnet who confirm that “this type of collaboration would be see very positively from the coordination area within the traditional humanitarian sector.” On the business development end, this setup would enable companies to get their foot in the door of the humanitarian sector—a multi-billion dollar industry. Members of the DHNet are early adopters of humanitarian technology and are ideally placed to demonstrate the added value of these platforms since they regularly partner with large humanitarian organizations. Indeed, DHNet operates as a partnership model. This would enable humanitarian professionals to learn about new Big Data tools, see them in action and, possibly, purchase full licenses for their organizations. In sum, data philanthropy is good for business.

I have colleagues at most of the companies listed above and thus plan to actively pursue this idea further. In the meantime, I’d be very grateful for any feedback and suggestions, particularly on the suggested protocols and MoUs. So I’ve set up this open and editable Google Doc for feedback.

Big thanks to the team at the Disaster Information Management Research Center (DIMRC) for planting the seeds of this idea during our recent meeting. Check out their very neat Emergency Access Initiative.

Geofeedia: Next Generation Crisis Mapping Technology?

My colleague Jeannine Lemaire from the Core Team of the Standby Volunteer Task Force (SBTF) recently pointed me to Geofeedia, which may very well be the next generation in crisis mapping technology. So I spent over an hour talking with GeoFeedia’s CEO, Phil Harris, to learn more about the platform and discuss potential applications for humanitarian response. The short version: I’m impressed; not just with the technology itself and potential, but also by Phil’s deep intuition and genuine interest in building a platform that enables others to scale positive social impact.

Situational awareness is absolutely key to emergency response, hence the rise of crisis mapping. The challenge? Processing and geo-referencing Big Data from social media sources to produce live maps has largely been a manual (and arduous) task for many in the humanitarian space. In fact, a number of humanitarian colleagues I’ve spoken to recently have complained that the manual labor required to create (and maintain) live maps is precisely why they aren’t able to launch their own crisis maps. I know this is also true of several international media organizations.

There have been several attempts at creating automated live maps. Take Havaria and Global Incidents Map, for example. But neither of these provide the customi-zability necessary for users to apply the platforms in meaningful ways. Enter Geofeedia. Lets take the recent earthquake and 800 aftershocks in Emilia, Italy. Simply type in the place name (or an exact address) and hit enter. Geofeedia automatically parses Twitter, YouTube, Flickr, Picasa and Instagram for the latest updates in that area and populates the map with this content. The algorithm pulls in data that is already geo-tagged and designated as public.

The geo-tagging happens on the smartphone, laptop/desktop when an image or Tweet is generated. The platform then allows you to pivot between the map and to browse through a collage of the automatically harvested content. Note that each entry includes a time stamp. Of course, since the search function is purely geo-based, the result will not be restricted to earthquake-related updates, hence the picture of friends at a picnic.

But lets click on the picture of the collapsed roof directly to the left. This opens up a new page with the following: the original picture and a map displaying where this picture was taken.

In between these, you’ll note the source of the picture, the time it was uploaded and the author. Directly below this you’ll find the option to query the map further by geographic distance. Lets click on the 300 meters option. The result is the updated collage below.

We know see a lot more content relevant to the earthquake than we did after the initial search. Geofeedia only parses for recently published information, which adds temporal relevance to the geographic search. The result of combing these two dimensions is a more filtered result. Incidentally, Geofeedia allows you to save and very easily share these searches and results. Now lets click on the first picture on the top left.

Geofeedia allows you to create collections (top right-hand corner).  I’ve called mine “Earthquake Damage” so I can collect all the relevant Tweets, pictures and video footage of the disaster. The platform gives me the option of inviting specific colleagues to view and help curate this new collection by adding other relevant content such as tweets and video footage. Together with Geofeedia’s multi-media approach, these features facilitate the clustering and triangulation of multi-media data in a very easy way.

Now lets pivot from these search results in collage form to the search results in map view. This display can also be saved and shared with others.

One of the clear strengths of Geofeedia is the simplicity of the user-interface. Key features and functions are esthetically designed. For example, if we wish to view the YouTube footage that is closest to the circle’s center, simply click on the icon and the video can be watched in the pop-up on the same page.

Now notice the menu just to the right of the YouTube video. Geofeedia allows you to create geo-fences on the fly. For example, we can click on “Search by Polygon” and draw a “digital fence” of that shape directly onto the map with just a few clicks of the mouse. Say we’re interested in the residential area just north of Via Statale. Simply trace the area, double-click to finish and then press on the magnifying glass icon to search for the latest social media updates and Geofeedia will return all content with relevant geo-tags.

The platform allows us to filter these results further the “Settings” menu as displayed below. On the technical side, the tool’s API supports ATOM/RSS, JSON and GeoRSS formats.

Geofeedia has a lot of potential vis-a-vis humanitarian applications, which is why the Standby Volunteer Task Force (SBTF) is partnering with the group to explore this potential further. A forthcoming blog post on the SBTF blog will outline this partnership in more detail.

In the meantime, below are a few thoughts and suggestions for Phil and team on how they can make Geofeedia even more relevant and compelling for humanitarian applications. A quick qualifier is in order beforehand, however. I often have a tendency to ask for the moon when discovering a new platform I’m excited about. The suggestions that follow are thus not criticism at all but rather the result of my imagination gone wild. So big congrats to Phil and team for having built what is already a very, very neat platform!

  • Topical search feature that enables users to search by location and a specific theme or topic.
  • Delete function that allows users to delete content that is not relevant to them either from the Map or Collage interface. In the future, perhaps some “basic” machine learning algorithms could be added to learn what types of content the user does not want displayed or prioritized.
  • Add function that gives users the option of adding relevant multi-media content, say perhaps from a blog post, a Wikipedia entry, news article or (Geo)RSS feed. I would be particularly interested in seeing a Storyful feed integrated into Geofeedia, for example. The ability to add KML files could also be interesting, e.g., a KML of an earthquake’s epicenter and estimated impact.
  • Commenting function that enables users to comment on individual data points (Tweets, pictures, etc) and a “discussion forum” feature that enables users to engage in text-based conversation vis-a-vis a specific data point.
  • Storify feature that gives users the ability to turn their curated content into a storify-like story board with narrative. A Storify plugin perhaps.
  • Ushahidi feature that enables users to export an item (Tweet, picture, etc) directly to an Ushahidi platform with just one click. This feature should also allow for the automatic publishing of said item on an Ushahidi map.
  • Alerts function that allows one to turn a geo-fence into an automated alert feature. For example, once I’ve created my geo-fence, having an option that allows me (and others) to subscribe to this geo-fence for future updates could be particularly interesting. These alerts would be sent out as emails (and maybe SMS) with a link to the new picture or Tweet that has been geo-tagged within the geographical area of the geo-fence. Perhaps each geo-fence could tweet updates directly to anyone subscribed to that Geofeedia deployment.
  • Trends alert feature that gives users the option of subscribing to specific trends of interest. For example, I’d like to be notified if the number of data points in my geo-fence increases by more than 25% within a 24-hour time period. Or more specifically whether the number of pictures has suddenly increased. These meta-level trends can provide important insights vis-a-vis early detection & response.
  • Analytics function that produces summary statistics and trends analysis for a geo-fence of interest. This is where Geofeedia could better capture temporal dynamics by including charts, graphs and simple time-series analysis to depict how events have been unfolding over the past hour vs 12 hours, 24 hours, etc.
  • Sentiment analysis feature that enables users to have an at-a-glance understanding of the sentiments and moods being expressed in the harvested social media content.
  • Augmented Reality feature … just kidding (sort-of).

Naturally, most or all of the above may not be in line with Geofeedia’s vision, purpose or business model. But I very much look forward to collaborating with Phil & team vis-a-vis our SBTF partnership. A big thanks to Jeannine once again for pointing me to Geofeedia, and equally big thanks to my SBTF colleague Timo Luege for his blog post on the platform. I’m thrilled to see more colleagues actively blog about the application of new technologies for disaster response.

On this note, anyone familiar with this new Iremos platform (above picture) from France? They recently contacted me to offer a demo.

Using Rayesna to Track the 2012 Egyptian Presidential Candidates on Twitter

My (future) colleague at the Qatar Foundation’s Computing Research Institute (QCRI) have just launched a new platform that Al Jazeera is using to track the 2012 Egyptian Presidential Candidates on Twitter. Called Rayesna, which  means “our president” in colloquial Egyptian Arabic, this fully automated platform uses cutting-edge Arabic computational linguistics processing developed by the Arabic Language Technology (ALT) group at QCRI.

“Through Rayesna, you can find out how many times a candidate is mentioned, which other candidate he is likely to appear with, and the most popular tweets for a candidate, with a special category for the most retweeted jokes about the candidates. The site also has a time-series to explore and compares the mentions of the candidate day-by-day. Caveats: 1. The site reflects only the people who choose to tweet, and this group may not be representative of general society; 2. Tweets often contain foul language and we do not perform any filtering.”

I look forward to collaborating with the ALT group and exploring how their platform might also be used in the context of humanitarian response in the Arab World and beyond. There may also be important synergies with the work of the UN Global Pulse, particularly vis-a-vis their use of Twitter for real-time analysis of vulnerable communities.

Behind the Scenes: The Digital Operations Center of the American Red Cross

The Digital Operations Center at the American Red Cross is an important and exciting development. I recently sat down with Wendy Harman to learn more about the initiative and to exchange some lessons learned in this new world of digital  humanitarians. One common challenge in emergency response is scaling. The American Red Cross cannot be everywhere at the same time—and that includes being on social media. More than 4,000 tweets reference the Red Cross on an average day, a figure that skyrockets during disasters. And when crises strike, so does Big Data. The Digital Operations Center is one response to this scaling challenge.

Sponsored by Dell, the Center uses customized software produced by Radian 6 to monitor and analyze social media in real-time. The Center itself sits three people who have access to six customized screens that relate relevant information drawn from various social media channels. The first screen below depicts some of key topical areas that the Red Cross monitors, e.g., references to the American Red Cross, Storms in 2012, and Delivery Services.

Circle sizes in the first screen depict the volume of references related to that topic area. The color coding (red, green and beige) relates to sentiment analysis (beige being neutral). The dashboard with the “speed dials” right underneath the first screen provides more details on the sentiment analysis.

Lets take a closer look at the circles from the first screen. The dots “orbiting” the central icon relate to the categories of key words that the Radian 6 platform parses. You can click on these orbiting dots to “drill down” and view the individual key words that make up that specific category. This circles screen gets updated in near real-time and draws on data from Twitter, Facebook, YouTube, Flickr and blogs. (Note that the distance between the orbiting dots and the center does not represent anything).

An operations center would of course not be complete without a map, so the Red Cross uses two screens to visualize different data on two heat maps. The one below depicts references made on social media platforms vis-a-vis storms that have occurred during the past 3 days.

The screen below the map highlights the bio’s of 50 individual twitter users who have made references to the storms. All this data gets generated from the “Engagement Console” pictured below. The purpose of this web-based tool, which looks a lot like Tweetdeck, is to enable the Red Cross to customize the specific types of information they’re looking form, and to respond accordingly.

Lets look at the Consul more closely. In the Workflow section on the left, users decide what types of tags they’re looking for and can also filter by priority level. They can also specify the type of sentiment they’re looking, e.g., negative feelings vis-a-vis a particular issue. In addition, they can take certain actions in response to each information item. For example, they can reply to a tweet, a Facebook status update, or a blog post; and they can do this directly from the engagement consul. Based on the license that the Red Cross users, up to 25 of their team members can access the Consul and collaborate in real-time when processing the various tweets and Facebook updates.

The Consul also allows users to create customized timelines, charts and wordl graphics to better understand trends changing over time in the social media space. To fully leverage this social media monitoring platform, Wendy and team are also launching a digital volunteers program. The goal is for these volunteers to eventually become the prime users of the Radian platform and to filter the bulk of relevant information in the social media space. This would considerably lighten the load for existing staff. In other words, the volunteer program would help the American Red Cross scale in the social media world we live in.

Wendy plans to set up a dedicated 2-hour training for individuals who want to volunteer online in support of the Digital Operations Center. These trainings will be carried out via Webex and will also be available to existing Red Cross staff.


As  argued in this previous blog post, the launch of this Digital Operations Center is further evidence that the humanitarian space is ready for innovation and that some technology companies are starting to think about how their solutions might be applied for humanitarian purposes. Indeed, it was Dell that first approached the Red Cross with an expressed interest in contributing to the organization’s efforts in disaster response. The initiative also demonstrates that combining automated natural language processing solutions with a digital volunteer net-work seems to be a winning strategy, at least for now.

After listening to Wendy describe the various tools she and her colleagues use as part of the Operations Center, I began to wonder whether these types of tools will eventually become free and easy enough for one person to be her very own operations center. I suppose only time will tell. Until then, I look forward to following the Center’s progress and hope it inspires other emergency response organizations to adopt similar solutions.

Crisis Mapping Syria: Automated Data Mining and Crowdsourced Human Intelligence

The Syria Tracker Crisis Map is without doubt one of the most impressive crisis mapping projects yet. Launched just a few weeks after the protests began one year ago, the crisis map is spearheaded by a just handful of US-based Syrian activists have meticulously and systematically documented 1,529 reports of human rights violations including a total of 11,147 killings. As recently reported in this NewScientist article, “Mapping the Human Cost of Syria’s Uprising,” the crisis map “could be the most accurate estimate yet of the death toll in Syria’s uprising […].” Their approach? “A combination of automated data mining and crowdsourced human intelligence,” which “could provide a powerful means to assess the human cost of wars and disasters.”

On the data-mining side, Syria Tracker has repurposed the HealthMap platform, which mines thousands of online sources for the purposes of disease detection and then maps the results, “giving public-health officials an easy way to monitor local disease conditions.” The customized version of this platform for Syria Tracker (ST), known as HealthMap Crisis, mines English information sources for evidence of human rights violations, such as killings, torture and detainment. As the ST Team notes, their data mining platform “draws from a broad range of sources to reduce reporting biases.” Between June 2011 and January 2012, for example, the platform collected over 43,o00 news articles and blog posts from almost 2,000 English-based sources from around the world (including some pro-regime sources).

Syria Tracker combines the results of this sophisticated data mining approach with crowdsourced human intelligence, i.e., field-based eye-witness reports shared via webform, email, Twitter, Facebook, YouTube and voicemail. This naturally presents several important security issues, which explains why the main ST website includes an instructions page detailing security precautions that need to be taken while sub-mitting reports from within Syria. They also link to this practical guide on how to protect your identity and security online and when using mobile phones. The guide is available in both English and Arabic.

Eye-witness reports are subsequently translated, geo-referenced, coded and verified by a group of volunteers who triangulate the information with other sources such as those provided by the HealthMap Crisis platform. They also filter the reports and remove dupli-cates. Reports that have a low con-fidence level vis-a-vis veracity are also removed. Volunteers use a dig-up or vote-up/vote-down feature to “score” the veracity of eye-witness reports. Using this approach, the ST Team and their volunteers have been able to verify almost 90% of the documented killings mapped on their platform thanks to video and/or photographic evidence. They have also been able to associate specific names to about 88% of those reported killed by Syrian forces since the uprising began.

Depending on the levels of violence in Syria, the turn-around time for a report to be mapped on Syria Tracker is between 1-3 days. The team also produces weekly situation reports based on the data they’ve collected along with detailed graphical analysis. KML files that can be uploaded and viewed using Google Earth are also made available on a regular basis. These provide “a more precisely geo-located tally of deaths per location.”

In sum, Syria Tracker is very much breaking new ground vis-a-vis crisis mapping. They’re combining automated data mining technology with crowdsourced eye-witness reports from Syria. In addition, they’ve been doing this for a year, which makes the project the longest running crisis maps I’ve seen in a hostile environ-ment. Moreover, they’ve been able to sustain these import efforts with just a small team of volunteers. As for the veracity of the collected information, I know of no other public effort that has taken such a meticulous and rigorous approach to documenting the killings in Syria in near real-time. On February 24th, Al-Jazeera posted the following estimates:

Syrian Revolution Coordination Union: 9,073 deaths
Local Coordination Committees: 8,551 deaths
Syrian Observatory for Human Rights: 5,581 deaths

At the time, Syria Tracker had a total of 7,901 documented killings associated with specific names, dates and locations. While some duplicate reports may remain, the team argues that “missing records are a much bigger source of error.” Indeed, They believe that “the higher estimates are more likely, even if one chooses to disregard those reports that came in on some of the most violent days where names were not always recorded.”

The Syria Crisis Map itself has been viewed by visitors from 136 countries around the world and 2,018 cities—with the top 3 cities being Damascus, Washington DC and, interestingly, Riyadh, Saudia Arabia. The witnessing has thus been truly global and collective. When the Syrian regime falls, “the data may help sub-sequent governments hold him and other senior leaders to account,” writes the New Scientist. This was one of the principle motivations behind the launch of the Ushahidi platform in Kenya over four years ago. Syria Tracker is powered by Ushahidi’s cloud-based platform, Crowdmap. Finally, we know for a fact that the International Criminal Court (ICC) and Amnesty International (AI) closely followed the Libya Crisis Map last year.

Twitter, Crises and Early Detection: Why “Small Data” Still Matters

My colleagues John Brownstein and Rumi Chunara at Harvard Univer-sity’s HealthMap project are continuing to break new ground in the field of Digital Disease Detection. Using data obtained from tweets and online news, the team was able to identify a cholera outbreak in Haiti weeks before health officials acknowledged the problem publicly. Meanwhile, my colleagues from UN Global Pulse partnered with Crimson Hexagon to forecast food prices in Indonesia by carrying out sentiment analysis of tweets. I had actually written this blog post on Crimson Hexagon four years ago to explore how the platform could be used for early warning purposes, so I’m thrilled to see this potential realized.

There is a lot that intrigues me about the work that HealthMap and Global Pulse are doing. But one point that really struck me vis-a-vis the former is just how little data was necessary to identify the outbreak. To be sure, not many Haitians are on Twitter and my impression is that most humanitarians have not really taken to Twitter either (I’m not sure about the Haitian Diaspora). This would suggest that accurate, early detection is possible even without Big Data; even with “Small Data” that is neither representative or indeed verified. (Inter-estingly, Rumi notes that the Haiti dataset is actually larger than datasets typically used for this kind of study).

In related news, a recent peer-reviewed study by the European Commi-ssion found that the spatial distribution of crowdsourced text messages (SMS) following the earthquake in Haiti were strongly correlated with building damage. Again, the dataset of text messages was relatively small. And again, this data was neither collected using random sampling (i.e., it was crowdsourced) nor was it verified for accuracy. Yet the analysis of this small dataset still yielded some particularly interesting findings that have important implications for rapid damage detection in post-emergency contexts.

While I’m no expert in econometrics, what these studies suggests to me is that detecting change-over–time is ultimately more critical than having a large-N dataset, let alone one that is obtained via random sampling or even vetted for quality control purposes. That doesn’t mean that the latter factors are not important, it simply means that the outcome of the analysis is relatively less sensitive to these specific variables. Changes in the baseline volume/location of tweets on a given topic appears to be strongly correlated with offline dynamics.

What are the implications for crowdsourced crisis maps and disaster response? Could similar statistical analyses be carried out on Crowdmap data, for example? How small can a dataset be and still yield actionable findings like those mentioned in this blog post?

Some Thoughts on Real-Time Awareness for Tech@State

I’ve been invited to present at Tech@State in Washington DC to share some thoughts on the future of real-time awareness. So I thought I’d use my blog to brainstorm and invite feedback from iRevolution readers. The organizers of the event have shared the following questions with me as a way to guide the conver-sation: Where is all of this headed?  What will social media look like in five to ten years and what will we do with all of the data? Knowing that the data stream can only increase in size, what can we do now to prepare and prevent being over-whelmed by the sheer volume of data?

These are big, open-ended questions, and I will only have 5 minutes to share some preliminary thoughts. I shall thus focus on how time-critical crowdsourcing can yield real-time awareness and expand from there.

Two years ago, my good friend and colleague Riley Crane won DARPA’s $40,000 Red Balloon Competition. His team at MIT found the location of 10 weather balloons hidden across the continental US in under 9 hours. The US covers more than 3.7 million square miles and the balloons were barely 8 feet wide. This was truly a needle-in-the-haystack kind of challenge. So how did they do it? They used crowdsourcing and leveraged social media—Twitter in particular—by using a “recursive incentive mechanism” to recruit thousands of volunteers to the cause. This mechanism would basically reward individual participants financially based on how important their contributions were to the location of one or more balloons. The result? Real-time, networked awareness.

Around the same time that Riley and his team celebrated their victory at MIT, another novel crowdsourcing initiative was taking place just a few miles away at The Fletcher School. Hundreds of students were busy combing through social and mainstream media channels for actionable and mappable information on Haiti following the devastating earthquake that had struck Port-au-Prince. This content was then mapped on the Ushahidi-Haiti Crisis Map, providing real-time situational awareness to first responders like the US Coast Guard and US Marine Corps. At the same time, hundreds of volunteers from the Haitian Diaspora were busy translating and geo-coding tens of thousands of text messages from disaster-affected communities in Haiti who were texting in their location & most urgent needs to a dedicated SMS short code. Fletcher School students filtered and mapped the most urgent and actionable of these text messages as well.

One year after Haiti, the United Nation’s Office for the Coordination of Humanitarian Affairs (OCHA) asked the Standby Volunteer Task Force (SBTF) , a global network of 700+ volunteers, for a real-time map of crowdsourced social media information on Libya in order to improve their own situational awareness. Thus was born the Libya Crisis Map.

The result? The Head of OCHA’s Information Services Section at the time sent an email to SBTF volunteers to commend them for their novel efforts. In this email, he wrote:

“Your efforts at tackling a difficult problem have definitely reduced the information overload; sorting through the multitude of signals on the crisis is no easy task. The Task Force has given us an output that is manageable and digestible, which in turn contributes to better situational awareness and decision making.”

These three examples from the US, Haiti and Libya demonstrate what is already possible with time-critical crowdsourcing and social media. So where is all this headed? You may have noted from each of these examples that their success relied on the individual actions of hundreds and sometimes thousands of volunteers. This is primarily because automated solutions to filter and curate the data stream are not yet available (or rather accessible) to the wider public. Indeed, these solutions tend to be proprietary, expensive and/or classified. I thus expect to see free and open source solutions crop up in the near future; solutions that will radically democratize the tools needed to gain shared, real-time awareness.

But automated natural language processing (NLP) and machine learning alone are not likely to succeed, in my opinion. The data stream is actually not a stream, it is a massive torent of non-indexed information, a 24-hour global firehose of real-time, distributed multi-media data that continues to outpace our ability to produce actionable intelligence from this torrential downpour of 0’s and 1’s. To turn this data tsunami into real-time shared awareness will require that our filtering and curation platforms become more automated and collaborative. I believe the key is thus to combine automated solutions with real-time collabora-tive crowdsourcing tools—that is, platforms that enable crowds to collaboratively filter and curate real-time information, in real-time.

Right now, when we comb through Twitter, for example, we do so on our own, sitting behind our laptop, isolated from others who may be seeking to filter the exact same type of content. We need to develop free and open source platforms that allow for the distributed-but-networked, crowdsourced filtering and curation of information in order to democratize the sense-making of the firehose. Only then will the wider public be able to win the equivalent of Red Balloon competitions without needing $40,000 or a degree from MIT.

I’d love to get feedback from readers about what other compelling cases or arguments I should bring up in my presentation tomorrow. So feel free to post some suggestions in the comments section below. Thank you!

Seeking the Trustworthy Tweet: Can “Tweetsourcing” Ever Fit the Needs of Humanitarian Organizations?

Can microblogged data fit the information needs of humanitarian organizations? This is the question asked by a group of academics at Pennsylvania State University’s College of Information Sciences and Technology. Their study (PDF) is an important contribution to the discourse on humanitarian technology and crisis information. The applied research provides key insights based on a series of interviews with humanitarian professionals. While I largely agree with the majority of the arguments presented in this study, I do have questions regarding the framing of the problem and some of the assertions made.

The authors note that “despite the evidence of strong value to those experiencing the disaster and those seeking information concerning the disaster, there has been very little uptake of message data by large-scale, international humanitarian relief organizations.” This is because real-time message data is “deemed as unverifiable and untrustworthy, and it has not been incorporated into established mechanisms for organizational decision-making.” To this end, “committing to the mobilization of valuable and time sensitive relief supplies and personnel, based on what may turn out be illegitimate claims, has been perceived to be too great a risk.” Thus far, the authors argue, “no mechanisms have been fashioned for harvesting microblogged data from the public in a manner, which facilitates organizational decisions.”

I don’t think this latter assertion is entirely true if one looks at the use of Twitter by the private sector. Take for example the services offered by Crimson Hexagon, which I blogged about 3 years ago. This successful start-up launched by Gary King out of Harvard University provides companies with real-time sentiment analysis of brand perceptions in the Twittersphere precisely to help inform their decision making. Another example is Storyful, which harvests data from authenticated Twitter users to provide highly curated, real-time information via microblogging. Given that the humanitarian community lags behind in the use and adoption of new technologies, it behooves us to look at those sectors that are ahead of the curve to better understand the opportunities that do exist.

Since the study principally focused on Twitter, I’m surprised that the authors did not reference the empirical study that came out last year on the behavior of Twitter users after the 8.8 magnitude earthquake in Chile. The study shows that about 95% of tweets related to confirmed reports validated that information. In contrast only 0.03% of tweets denied the validity of these true cases. Interestingly, the results also show  that “the number of tweets that deny information becomes much larger when the information corresponds to a false rumor.” In fact, about 50% of tweets will deny the validity of false reports. This means it may very well be posible to detect rumors by using aggregate analysis on tweets.

On framing, I believe the focus on microblogging and Twitter in particular misses the bigger picture which ultimately is about the methodology of crowdsourcing rather than the technology. To be sure, the study by Penn State could just as well have been titled “Seeking the Trustworthy SMS.” I think this important research on microblogging would be stronger if this distinction were made and the resulting analysis tied more closely to the ongoing debate on crowdsourcing crisis information that began during the response to Haiti’s earthquake in 2010.

Also, as was noted during the Red Cross Summit in 2010, more than two-thirds of respondents to a survey noted that they would expect a response within an hour if they posted a need for help on a social media platform (and not just Twitter) during a crisis. So whether humanitarian organizations like it or not, crowdsourced social media information cannot be ignored.

The authors carried out a series of insightful interviews with about a dozen international humanitarian organizations to try and better understand the hesitation around the use of Twitter for humanitarian response. As noted earlier, however, it is not Twitter per se that is a concern but the underlying methodology of crowdsourcing.

As expected, interviewees noted that they prioritize the veracity of information over the speed of communication. “I don’t think speed is necessarily the number one tool that an emergency operator needs to use.” Another interviewee opined that “It might be hard to trust the data. I mean, I don’t think you can make major decisions based on a couple of tweets, on one or two tweets.” What’s interesting about this latter comment is that it implies that only one channel of information, Twitter, is to be used in decision-making, which is a false argument and one that nobody I know has ever made.

Either way, the trade-off between speed and accuracy is a well known one. As mentioned in this blog post from 2009, information is perishable and accuracy is often a luxury in the first few hours and days following a major disaster. As the authors for the study rightly note, “uncertainty is ‘always expected, if sometimes crippling’ (Benini, 1997) for NGOs involved in humanitarian relief.” Ultimately, the question posed by the authors of the Penn study can be boiled down to this: is some information better than no information if it cannot be immediately verified? In my opinion, yes. If you have some information, then at least you can investigate it’s veracity which may lead to action. I also believe that from this philosophical point of view, the answer would still be yes.

Based on the interviews, the authors found that organizations engaged in immediate emergency response were less likely to make use of Twitter (or crowdsourced information) as a channel for information. As one interviewee put it, “Lives are on the line. Every moment counts. We have it down to a science. We know what information we need and we get in and get it…” In contrast, those organizations engaged in subsequent phases of disaster response were thought more likely to make use of crowdsourced data.

I’m not entirely convinced by this: “We know what information we need and we get in and get it…”. Yes, humanitarian organizations typically know but whether they get it, and in time, is certainly not a given. Just look at the humanitarian responses to Haiti and Libya, for example. Organizations may very well be “unwilling to trade data assurance, veracity and authenticity for speed,” but sometimes this mindset will mean having absolutely no information. This is why OCHA asked the Standby Volunteer Taskforce to provide them with a live crowdsourced social media may of Libya. In Haiti, while the UN is not thought to have used crowdsourced SMS data from Mission 4636, other responders like the Marine Corps did.

Still, according to one interviewee, “fast is good, but bad information fast can kill people. It’s got to be good, and maybe fast too.” This assumes that no information doesn’t kill people. Also good information that is late, can also kill people. As one of the interviewees admitted when using traditional methods, “it can be quite slow before all that [information] trickles through all the layers to get to us.” The authors of the study also noted that, “Many [interviewees] were frustrated with how slow the traditional methods of gathering post-disaster data had remained despite the growing ubiquity of smart phones and high quality connectivity and power worldwide.”

On a side note, I found the following comment during the interviews especially revealing: “When we do needs assessments, we drive around and we look with our eyes and we talk to people and we assess what’s on the ground and that’s how we make our evaluations.” One of the common criticisms leveled against the use of crowdsourced information is that it isn’t representative. But then again, driving around, checking things out and chatting with people is hardly going to yield a representative sample either.

One of the main findings from this research has to do with a problem in attitude on the part of humanitarian organizations. “Each of the interviewees stated that their organization did not have the organizational will to try out new technolo-gies. Most expressed this as a lack of resources, support, leadership and interest to adopt new technologies.” As one interview noted, “We tried to get the president and CEO both to use Twitter. We failed abysmally, so they’re not– they almost never use it.” Interestingly, “most of the respondents admitted that many of their technological changes were motivated by the demands of their donors. At this point in time their donors have not demanded that these organizations make use of microblogged data. The subjects believed they would need to wait until this occurred for real change to begin.”

For me the lack of will has less to do with available resources and limited capacity and far more to do with a generational gap. When today’s young professionals in the humanitarian space work their way up to more executive positions, we’ll  see a significant change in attitude within these organizations. I’m thinking in particular of the many dozens of core volunteers who played a pivotal role in the crisis mapping operations in Haiti, Chile, Pakistan, Russia and most recently Libya. And when attitude changes, resources can be reallocated and new priorities can be rationalized.

What’s interesting about these interviews is that despite all the concerns and criticisms of crowdsourced Twitter data, all interviewees still see microblogged data as a “vast trove of potentially useful information concerning a disaster zone.” One of the professionals interviewed said, “Yes! Yes! Because that would – again, it would tell us what resources are already in the ground, what resources are still needed, who has the right staff, what we could provide. I mean, it would just – it would give you so much more real-time data, so that as we’re putting our plans together we can react based on what is already known as opposed to getting there and discovering, oh, they don’t really need medical supplies. What they really need is construction supplies or whatever.”

Another professional stated that, “Twitter data could potentially be used the same way… for crisis mapping. When an emergency happens there are so many things going on in the ground, and an emergency response is simply prioritization, taking care of the most important things first and knowing what those are. The difficult thing is that things change so quickly. So being able to gather information quickly…. <with Twitter> There’s enormous power.”

The authors propose three possible future directions. The first is bounded microblogging, which I have long referred to as “bounded crowdsourcing.” It doesn’t make sense to focus on the technology instead of the methodology because at the heart of the issue are the methods for information collection. In “bounded crowdsourcing,” membership is “controlled to only those vetted by a particular organization or community.” This is the approach taken by Storyful, for example. One interviewee acknowledge that “Twitter might be useful right after a disaster, but only if the person doing the Tweeting was from <NGO name removed>, you know, our own people. I guess if our own people were sending us back Tweets about the situation it could help.”

Bounded crowdsourcing overcomes the challenge of authentication and verification but obviously with a tradeoff in the volume of data collected “if an additional means were not created to enable new members through an automatic authentication system, to the bounded microblogging community.” However, the authors feel that bounded crowdsourcing environments “undermine the value of the system” since “the power of the medium lies in the fact that people, out of their own volition, make localized observations and that organizations could harness that multitude of data. The bounded environment argument neutralizes that, so in effect, at that point, when you have a group of people vetted to join a trusted circle, the data does not scale, because that pool by necessity would be small.”

That said, I believe the authors are spot on when they write that “Bounded environments might be a way of introducing Twitter into the humanitarian centric organizational discourse, as a starting point, because these organizations, as seen from the evidence presented above, are not likely to initially embrace the medium. Bounded environments could hence demonstrate the potential for Twitter to move beyond the PR and Communications departments.”

The second possible future direction is to treat crowdsourced data is ambient, “contextual information rather than instrumental information, (i.e., factual in nature).” This grassroots information could be considered as an “add-on to traditional, trusted institutional lines of information gathering.” As one interviewee noted, “Usually information exists. The question is the context doesn’t exist…. that’s really what I see as the biggest value [of crowdsourced information] and why would you use that in the future is creating the context…”.

The authors rightly suggest that “that adding contextual information through microblogged data may alleviate some of the uncertainty during the time of disaster. Since the microblogged data would not be the single data source upon which decisions would be made, the standards for authentication and security could be less stringent. This solution would offer the organization rich contextual data, while reducing the need for absolute data authentication, reducing the need for the organization to structurally change, and reducing the need for significant resources.” This is exactly how I consider and treat crowdsourced data.

The third and final forward-looking solution is computational. The authors “believe better computational models will eventually deduce informational snippets with acceptable levels of trust.” They refer to Ushahidi’s SwiftRiver project as an example.

In sum, this study is an important contribution to the discourse. The challenges around using crowdsourced crisis information are well known. If I come across as optimistic, it is for two reasons. First, I do think a lot can be done to address the challenges. Second, I do believe that attitudes in the humanitarian sector will continue to change.

Analyzing the Veracity of Tweets during a Major Crisis

A research team at Yahoo recently completed an empirical study (PDF) on the behavior of Twitter users after the 8.8 magnitude earthquake in Chile. The study was based on 4,727,524 indexed tweets, about 20% of which were replies to other tweets. What is particularly interesting about this study is that the team also analyzed the spread of false rumors and confirmed news that were disseminated on Twitter.

The authors “manually selected some relevant cases of valid news items, which were confirmed at some point by reliable sources.” In addition, they “manually selected important cases of baseless rumors which emerged during the crisis (confirmed to be false at some point).” Their goal was to determine whether users interacted differently when faced with valid news vs false rumors.

The study shows that about 95% of tweets related to confirmed reports validated that information. In contrast only 0.03% of tweets denied the validity of these true cases. Interestingly, the results also show  that “the number of tweets that deny information becomes much larger when the information corresponds to a false rumor.” In fact, about 50% of tweets will deny the validity of false reports. The table below lists the full results.

The authors conclude that “the propagation of tweets that correspond to rumors differs from tweets that spread news because rumors tend to be questioned more than news by the Twitter community. Notice that this fact suggests that the Twitter community works like a collaborative filter of information. This result suggests also a very promising research line: it could posible to detect rumors by using aggregate analysis on tweets.”

I think these findings are particularly important for projects like *Swift River, which try to validate crowdsourced crisis information in real-time. I would also be interested to see a similar study on tweets around the Haitian earthquake to explore whether this “collaborative filter” dynamic is an emergent phenomena in this complex systems or simply an artifact of something else.

Interested in learning more about “information forensics”? See this link and the articles below: