Category Archives: Big Data

CrisisTracker: Collaborative Social Media Analysis For Disaster Response

I just had the pleasure of speaking with my new colleague Jakob Rogstadius from Madeira Interactive Technologies Institute (Madeira-TTI). Jakob is working on CrisisTracker, a very interesting platform designed to facilitate collaborative social media analysis for disaster response. The rationale for CrisisTracker is the same one behind Ushahidi’s SwiftRiver project and could be hugely helpful for crisis mapping projects carried out by the Standby Volunteer Task Force (SBTF).

From the CrisisTracker website:

“During large-scale complex crises such as the Haiti earthquake, the Indian Ocean tsunami and the Arab Spring, social media has emerged as a source of timely and detailed reports regarding important events. However, indivi-dual disaster responders, government officials or citizens who wish to access this vast knowledge base are met with a torrent of information that quickly results in information overload. Without a way to organize and navigate the reports, important details are easily overlooked and it is challenging to use the data to get an overview of the situation as a whole.”

We (Madeira University, University of Oulu and IBM Research) believe that volunteers around the world would be willing to assist hard-pressed decision makers with information management, if the tools were available. With this vision in mind, we have developed Crisis-Tracker.”

Like SwiftRiver, CrisisTracker combines some automated clustering of content with the crowdsourced curation of said content for further filtering. “Any user of the system can directly contribute tags that make it easier for other users to retrieve information and explore stories by similarity. In addition, users of the system can influence how tweets are grouped into stories.” Stories can be filtered by Report Category, Keywords, Named Entities, Time and Location. CrisisTracker also allows for simple geo-fencing to capture and list only those Tweets displayed on a given map.

Geolocation, Report Categories and Named Entities are all generated manually. The clustering of reports into stories is done automatically using keyword frequencies. So if keyword dictionaries exist for other languages, the platform could be used in these other languages as well. The result is a list of clustered Tweets displayed below the map, with the most popular cluster at the top.

Clicking on an entry like the row in red above opens up a new page, like the one below. This page lists a group of tweets that all discuss the same specific event, in this case an explosion in Syria’s capital.

What is particularly helpful about this setup is the meta-data displayed for this story or event: the number of people who tweeted about the story, the number of tweets about the story, the first day/time the story was shared on twitter. In addition, the first tweet to report the story is listed along, which is very helpful. This list can be ranked according to “Size” which is a figure that reflects the minimum number of original tweets and the number of Twitter users who shared these tweets. This is a particularly useful metric (and way to deal with spammers). Users also have the option of listing the first 50 tweets that referenced the story.

As you may be able to tell from the “Hide Story” and “Remove” buttons on the righthand-side of the display above, each clustered story and indeed tweet can be hidden or removed if not relevant. This is where crowdsourced curation comes in. In addition, CrisisTracker enable users to geo-tag and categorize each tweets according to report type (e.g., Violence, Deaths, Request/Need, etc.), general keywords (e.g., #assad, #blasts, etc.) and named entities. Note the the keywords can be removed and more high-quality tags can be added or crowdsourced by users as well (see below).

CrisisTracker also suggests related stories that may be of interest to the user based on the initial clustering and filtering—assisted manual clustering. In addition, the platform’s API means that the data can then be exported in XML using a simple parser. So interoperability with platforms like Ushahidi’s would be possible. After our call, Jakob added a link on each story page in the system (a small XML icon below the related stories) to get the story in XML format. Any other system can now take this URL and parse the story into its own native format. Jakob is also looking to build a number of extensions to CrisisTracker and a “Share with Ushahidi” button may be one such future extension. Crisis-Tracker is basically Jakob’s core PhD project, which is very cool, so he’ll be working on this for at least one more year.

In sum, this could very well be the platform that many of us in the crisis mapping space have been waiting for. As I wrote in February 2012, turning the Twitter-sphere “into real-time shared awareness will require that our filtering and curation platforms become more automated and collaborative. I believe the key is thus to combine automated solutions with real-time collaborative crowd-sourcing tools—that is, platforms that enable crowds to collaboratively filter and curate real-time information, in real-time. Right now, when we comb through Twitter, for example, we do so on our own, sitting behind our laptop, isolated from others who may be seeking to filter the exact same type of content. We need to develop free and open source platforms that allow for the distributed-but-networked, crowdsourced filtering and curation of information in order to democratize the sense-making of the firehose.”

Actually, I’ve been advocating for this approach since early 2009. So I’m really excited about Jakob’s project. We’ll be partnering with him and the Standby Volunteer Task Force (SBTF) in September 2012 to test the platform and provide him with expert feedback on how to further streamline the tool for collaborative social media analysis and crisis mapping. Jakob is also looking for domain experts to help on this study. In the meantime, I’ve invited Jakob to present Crisis-Tracker at the 2012 CrisisMappers Conference in Washington DC and very much hope he can join us to demo his tool to us in person. In the meantime, the video above provides an excellent overview of CrisisTracker, as does the project website. Finally, the project is also open source and available on Github here.

Epilogue: The main problem with CrisisTracker is that it is still too manual; it does not include any machine learning & artificial intelligence features; and has only focused on Syria. This may explain why it has not gained traction in the humanitarian space so far.

Towards a Twitter Dashboard for the Humanitarian Cluster System

One of the principal Research and Development (R&D) projects I’m spearheading with colleagues at the Qatar Computing Research Institute (QCRI) has been getting a great response from several key contacts at the UN’s Office for the Coordination of Humanitarian Affairs (OCHA). In fact, their input has been instrumental in laying the foundations for our early R&D efforts. I therefore highlighted the initiative during my recent talk at the UN’s ECOSOC panel in New York, which was moderated by OCHA Under-Secretary General Valerie Amos. The response there was also very positive. So what’s the idea? To develop the foundations for a Twitter Dashboard for the Humanitarian Cluster System.

The purpose of the Twitter Dashboard for Humanitarian Clusters is to extract relevant information from twitter and aggregate this information according to Cluster for analytical purposes. As the above graphic shows, clusters focus on core humanitarian issues including Protection, Shelter, Education, etc. Our plan is to go beyond standard keyword search and simple Natural Language Process-ing (NLP) approaches to more advanced Machine Learning (ML) techniques and social computing methods. We’ve spent the past month asking various contacts whether anyone has developed such a dashboard but thus far have not come across any pre-existing efforts. We’ve also spent this time getting input from key colleagues at OCHA to ensure that what we’re developing will be useful to them.

It is important to emphasize that the project is purely experimental for now. This is one of the big advantages of being part of an institute for advanced computing R&D; we get to experiment and carry out applied research on next-generation humanitarian technology solutions. We realize full well what the many challenges and limitations of using Twitter as an information source are, so I won’t repeat these here. The point is not to suggest that a would-be Twitter Dashboard should be used instead of existing information management platforms. As United Nations colleagues themselves have noted, such a dashboard would simply be another dial on their own dashboards, which may at times prove useful, especially when compared or integrated with other sources of information.

Furthermore, if we’re serious about communicating with disaster affected comm-unities and the latter at times share crisis information on Twitter, then we may want to listen to what they are saying. This includes Diasporas as well. The point, quite simply, is to make full use of Twitter by at least extracting all relevant and meaningful information that contributes to situational awareness. The plan, therefore, is to have the Twitter Dashboard for Humanitarian Clusters aggregate information relevant to each specific cluster and to then provide key analytics for this content in order to reveal potentially interesting trends and outliers within each cluster.

Depending on how the R&D goes, we envision adding “credibility computing” to the Dashboard and expect to collaborate with our Arabic Language Technology Center to add Arabic tweets as well. Other languages could also be added in the future depending on initial results. Also, while we’re presently referring to this platform as a “Twitter” Dashboard, adding SMS,  RSS feeds, etc., could be part of a subsequent phase. The focus would remain specifically on the Humanitarian Cluster system and the clusters’ underlying minimum essential indicators for decision-making.

The software and crisis ontologies we are developing as part of these R&D efforts will all be open source. Hopefully, we’ll have some initial results worth sharing by the time the International Conference of Crisis Mappers (ICCM 2012) rolls around in mid-October. In the meantime, we continue collaborating with OCHA and other colleagues and as always welcome any constructive feedback from iRevolution readers.

Truth in the Age of Social Media: A Social Computing and Big Data Challenge

I have been writing and blogging about “information forensics” for a while now and thus relished Nieman Report’s must-read study on “Truth in the Age of Social Media.” My applied research has specifically been on the use of social media to support humanitarian crisis response (see the multiple links at the end of this blog post). More specifically, my focus has been on crowdsourcing and automating ways to quantify veracity in the social media space. One of the Research & Development projects I am spearheading at the Qatar Computing Research Institute (QCRI) specifically focuses on this hybrid approach. I plan to blog about this research in the near future but for now wanted to share some of the gems in this superb 72-page Nieman Report.

In the opening piece of the report, Craig Silverman writes that “never before in the history of journalism—or society—have more people and organizations been engaged in fact checking and verification. Never has it been so easy to expose an error, check a fact, crowdsource and bring technology to bear in service of verification.” While social media is new, traditional journalistic skills and values are still highly relevant to verification challenges in the social media space. In fact, some argue that “the business of verifying and debunking content from the public relies far more on journalistic hunches than snazzy technology.”

I disagree. This is not an either/or challenge. Social computing can help every-one, not just journalists, develop and test hunches. Indeed, it is imperative that these tools be in the reach of the general public since a “public with the ability to spot a hoax website, verify a tweet, detect a faked photo, and evaluate sources of information is a more informed public. A public more resistant to untruths and so-called rumor bombs.” This public resistance to untruths can itself be moni-tored and modeled to quantify veracity, as this study shows.

David Turner from the BBC writes that “while some call this new specialization in journalism ‘information forensics,’ one does not need to be an IT expert or have special equipment to ask and answer the fundamental questions used to judge whether a scene is staged or not.” No doubt, but as Craig rightly points out, “the complexity of verifying content from myriad sources in various mediums and in real time is one of the great new challenges for the profession.” This is fundamentally a Social Computing, Crowd Computing and Big Data problem. Rumors and falsehoods are treated as bugs or patterns of interference rather than as a feature. The key here is to operate at the aggregate level for statistical purposes and to move beyond the notion of true/false as a dichotomy and to-wards probabilities (think statistical physics). Clustering social media across different media and cross-triangulation using statistical models is one area I find particularly promising.

Furthermore, the fundamental questions used to judge whether or not a scene is staged can be codified. “Old values and skills aren’t still at the core of the discipline.” Indeed, and heuristics based on decades of rich experience in the field of journalism can be coded into social computing algorithms and big data analytics platforms. This doesn’t mean that a fully automated solution should be the goal. The hunch of the expert when combined with the wisdom of the crowd and advanced social computing techniques is far more likely to be effective. As CNN’s Lila King writes, technology may not always be able to “prove if a story is reliable but offers helpful clues.” The quicker we can find those clues, the better.

It is true, as Craig notes, that repressive regimes “create fake videos and images and upload them to YouTube and other websites in the hope that news organizations and the public will find them and take them for real.” It is also true that civil society actors can debunk these falsifications as often I’ve noted in my research. While the report focuses on social media, we must not forget that off-line follow up and investigation is often an option. During the 2010 Egyptian Parliamentary Elections, civil society groups were able to verify 91% of crowd-sourced information in near real time thanks to hyper-local follow up and phone calls. (Incidentally, they worked with a seasoned journalist from Thomson Reuters to design their verification strategies). A similar verification strategy was employed vis-a-vis the atrocities commi-tted in Kyrgyzstan two years ago.

In his chapter on “Detecting Truth in Photos”, Santiago Lyon from the Associated Press (AP) describes the mounting challenges of identifying false or doctored images. “Like other news organizations, we try to verify as best we can that the images portray what they claim to portray. We look for elements that can support authenticity: Does the weather report say that it was sunny at the location that day? Do the shadows fall the right way considering the source of light? Is cloth- ing consistent with what people wear in that region? If we cannot communicate with the videographer or photographer, we will add a disclaimer that says the AP “is unable to independently verify the authenticity, content, location or date of this handout photo/video.”

Santiago and his colleagues are also exploring more automated solutions and believe that “manipulation-detection software will become more sophisticated and useful in the future. This technology, along with robust training and clear guidelines about what is acceptable, will enable media organizations to hold the line against willful image manipulation, thus maintaining their credibility and reputation as purveyors of the truth.”

David Turner’s piece on the BBC’s User-Generated Content (UGC) Hub is also full of gems. “The golden rule, say Hub veterans, is to get on the phone whoever has posted the material. Even the process of setting up the conversation can speak volumes about the source’s credibility: unless sources are activists living in a dictatorship who must remain anonymous.” This was one of the strategies used by Egyptians during the 2010 Parliamentary Elections. Interestingly, many of the anecdotes that David and Santiago share involve members of the “crowd” letting them know that certain information they’ve posted is in fact wrong. Technology could facilitate this process by distributing the challenge of collective debunking in a far more agile and rapid way using machine learning.

This may explain why David expects the field of “information forensics” to becoming industrialized. “By that, he means that some procedures are likely to be carried out simultaneously at the click of an icon. He also expects that technological improvements will make the automated checking of photos more effective. Useful online tools for this are Google’s advanced picture search or TinEye, which look for images similar to the photo copied into the search function.” In addition, the BBC’s UGC Hub uses Google Earth to “confirm that the features of the alleged location match the photo.” But these new technologies should not and won’t be limited to verifying content in only one media but rather across media. Multi-media verification is the way to go.

Journalists like David Turner often (and rightly) note that “being right is more important than being first.” But in humanitarian crises, information is the most perishable of commodities, and being last vis-a-vis information sharing can actual do harm. Indeed, bad information can have far-reaching negative con-sequences, but so can no information. This tradeoff must be weighed carefully in the context of verifying crowdsourced crisis information.

Mark Little’s chapter on “Finding the Wisdom in the Crowd” describes the approach that Storyful takes to verification. “At Storyful, we thinking a com-bination of automation and human skills provides the broadest solution.” Amen. Mark and his team use the phrase “human algorithm” to describe their approach (I use the term Crowd Computing). In age when every news event creates a community, “authority has been replaced by authenticity as the currency of social journalism.” Many of Storyful’s tactics for vetting authenticity are the same we use in crisis mapping when we seek to validate crowdsourced crisis information. These combine the common sense of an investigative journalist with advanced digital literacy.

In her chapter, “Taking on the Rumor Mill,” Katherine Lee rights that a “disaster is ready-made for social media tools, which provide the immediacy needed for reporting breaking news.” She describes the use of these tools during and after the tornado hat hit Alabama in April 2011. What I found particularly interesting was her news team’s decision to “log to probe some of the more persistent rumors, tracking where they might have originated and talking with officials to get the facts. The format fit the nature of the story well. Tracking the rumors, with their ever-changing details, in print would have been slow and awkward, and the blog allowed us to update quickly.” In addition, the blog format “gave readers a space to weigh in with their own evidence, which proved very useful.”

The remaining chapters in the Nieman Report are equally interesting but do not focus on “information forensics” per se. I look forward to sharing more on QCRI’s project on quantifying veracity in the near future as our objective is to learn from experts such as those cited above and codify their experience so we can leverage the latest breakthroughs in social computing and big data analytics to facilitate the verification and validation of crowdsourced social media content. It is worth emphasizing that these codified heuristics cannot and must not remain static, nor can the underlying algorithms become hardwired. More on this in a future post. In the meantime, the following links may be of interest:

  • Information Forensics: Five Case Studies on How to Verify Crowdsourced Information from Social Media (Link)
  • How to Verify and Counter Rumors in Social Media (Link)
  • Data Mining to Verify Crowdsourced Information in Syria (Link)
  • Analyzing the Veracity of Tweets During a Crisis (Link)
  • Crowdsourcing for Human Rights: Challenges and Opportunities for Information Collection & Verification (Link)
  • Truthiness as Probability: Moving Beyond the True or False Dichotomy when Verifying Social Media (Link)
  • The Crowdsourcing Detective: Crisis, Deception and Intrigue in the Twittersphere (Link)
  • Crowdsourcing Versus Putin (Link)
  • Wiki on Truthiness resources (Link)
  • My TEDx Talk: From Photosynth to ALLsynth (Link)
  • Social Media and Life Cycle of Rumors during Crises (Link)
  • Wag the Dog, or How Falsifying Crowdsourced Data Can Be a Pain (Link)

Crisis Tweets: Natural Language Processing to the Rescue?

My colleagues at the University of Colorado, Boulder, have been doing some very interesting applied research on automatically extracting “situational awareness” from tweets generated during crises. As is increasingly recognized by many in the humanitarian space, Twitter can at times be an important source of relevant information. The challenge is to make sense of a potentially massive number of crisis tweets in near real-time to turn this information into situational awareness.

Using Natural Language Processing (NLP) and Machine Learning (ML), Colorado colleagues have developed a “suite of classifiers to differentiate tweets across several dimensions: subjectivity, personal or impersonal style, and linguistic register (formal or informal style).” They suggest that tweets contributing to situational awareness are likely to be “written in a style that is objective, impersonal, and formal; therefore, the identification of subjectivity, personal style and formal register could provide useful features for extracting tweets that contain tactical information.” To explore this hypothesis, they studied the follow four crisis events: the North American Red River floods of 2009 and 2010, the 2009 Oklahoma grassfires, and the 2010 Haiti earthquake.

The findings of this study were presented at the Association for the Advancement of Artificial Intelligence. The team from Colorado demonstrated that their system, which automatically classifies Tweets that contribute to situational awareness, works particularly well when analyzing “low-level linguistic features,” i.e., word-frequencies and key-word search. Their analysis also showed that “linguistically-motivated features including subjectivity, personal/impersonal style, and register substantially improve system performance.” In sum, “these results suggest that identifying key features of user behavior can aid in predicting whether an individual tweet will contain tactical information. In demonstrating a link between situational awareness and other markable characteristics of Twitter communication, we not only enrich our classification model, we also enhance our perspective of the space of information disseminated during mass emergency.”

The paper, entitled: “Natural Language Processing to the Rescue? Extracting ‘Situational Awareness’ Tweets During Mass Emergency,” details the findings above and is available here. The study was authored by Sudha Verma, Sarah Vieweg, William J. Corvey, Leysia Palen, James H. Martin, Martha Palmer, Aaron Schram and Kenneth M. Anderson.

Situational Awareness in Mass Emergency: Behavioral & Linguistic Analysis of Disaster Tweets

Sarah Vieweg‘s doctoral dissertation from the University of Colorado is a must-read for anyone interested in the use of twitter during crises. I read the entire 300-page study because it provides important insights on how automated natural language processing (NLP) can be applied to the Twittersphere to provide situational awareness following a sudden-onset emergency. Big thanks to Sarah for sharing her dissertation with QCRI. I include some excerpts below to highlight the most important findings from her excellent research.

Introduction

“In their research on human behavior in disaster, Fritz and Marks (1954) state: ‘[T]he immediate problem in a disaster situation is neither un-controlled behavior nor intense emotional reaction, but deficiencies of coordination and organization, complicated by people acting upon individual…definitions of the situation.'”

“Fritz and Marks’ assertion that people define disasters individually, which can lead to problematic outcomes, speaks to the need for common situational awareness among affected populations. Complete information is not attained during mass emergency, else it would not be a mass emergency. However, the more information people have and the better their situational awareness, and the better equipped they are to make tactical, strategic decisions.”

“[D]uring crises, people seek information from multiple sources in an attempt to make locally optimal decisions within given time constraints. The first objective, then, is to identify what tweets that contribute to situational awareness ‘look like’—i.e. what specific information do they contain? This leads to the next objective, which is to identify how information is communicated at a linguistic level. This process provides the foundation for tools that can automatically extract pertinent, valuable information—training machines to correctly ‘understand’ human language involves the identification of the words people use to communicate via Twitter when faced with a disaster situation.”

Research Design & Results

Just how much situational awareness can be extracted from twitter during a crisis? What constitutes situational awareness in the first place vis-a-vis emergency response? And can the answer to these questions yield a dedicated ontology that can be fed into automated natural language processing platforms to generate real-time, shared awareness? To answer these questions, Sarah analyzed four emergency events: Oklahoma Fires (2009), Red River Floods (2009 & 2010) and the Haiti Earthquake (2010).

She collected tweets generated during each of these emergencies and developed a three-step qualitative coding process to analyze what kinds of information on Twitter contribute to situational awareness during a major emergency. As a first step, each tweet was categorized as either:

O: Off-topic
“Tweets do not contain any information that mentions or relates to the emergency event.”

R: On-topic and Relevant to Situational Awareness
“Tweets contain information that provides tactical, actionable information that can aid people in making decisions, advise others on how to obtain specific information from various sources, or offer immediate post- impact help to those affected by the mass emergency.”

N: On-topic and Not Relevant to Situational Awareness
“Tweets are on-topic because they mention the emergency by including offers of prayer and support in relation to the emergency, solicitations for donations to charities, or casual reference to the emergency event. But these tweets do not meet the above criteria for situational relevance.”

The O, R, and N coding of the crisis datasets resulted in the following statistics for each of the four datasets:

For the second coding step, on-topic relevant tweets were annotated with more specific information based on the following coding rule:

S: Social Environment
“These tweets include information about how people and/or animals are affected by a hazard, questions asked in relation to the hazard, responses to the hazard and actions to take that directly relate to the hazard and the emergency situation it causes. These tweets all include description of a human element in that they explain or display human behavior.”

B: Built Environment
“Tweets that include information about the effect of the hazard on the built environment, including updates on the state of infrastructure, such as road closures or bridge outages, damage to property, lack of damage to property and the overall state or condition of structures.”

P: Physical Environment
“Tweets that contain specific information about the hazard including particular locations of the hazard agent or where the hazard agent is expected or predicted to travel or predicted states of the hazard agent going forward, notes about past hazards that compare to the current hazard, and how weather may affect hazard conditions. These tweets additionally include information about the type of hazard in general […]. This category also subsumes any general information about the area under threat or in the midst of an emergency […].”

The result of this coding for Haiti is depicted in the figures below.

According to the results, the social environment (‘S’) category is most common in each of the datasets. “Disasters are social events; in each disaster studied in this dissertation, the disaster occurred because a natural hazard impacted a large number of people.”

For the third coding step, Sarah created a comprehensive list of several dozen  “Information Types” for each “Environment” using inductive, data-driven analysis of twitter communications, which she combined with findings from the disaster literature and official government procedures for disaster response. In total, Sarah identified 32 specific types of information that contribute to situational awareness. The table below compares the Twitter Information Types for all three environments as related to government procedures, for example.

“Based on the discourse analysis of Twitter communications broadcast during four mass emergency events,” Sarah identified 32 specific types of information that “contribute to situational awareness. Subsequent analysis of the sociology of disaster literature, government documents and additional research on the use of Twitter in mass emergency uncovered three additional types of information.”

In sum, “[t]he comparison of the information types [she] uncovered in [her] analysis of Twitter communications to sociological research on disaster situations, and to governmental procedures, serves as a way to gauge the validity of [her] ground-up, inductive analysis.” Indeed, this enabled Sarah to identify areas of overlap as well as gaps that needed to be filled. The final Information Type framework is listed below:

And here are the results of this coding framework when applied to the Haiti data:

“Across all four datasets, the top three types of information Twitter users communicated comprise between 36.7-52.8% of the entire dataset. This is an indication that though Twitter users communicate about a variety of informa-tion, a large portion of their attention is focused on only a few types of in-formation, which differ across each emergency event. The maximum number of information types communicated during an event is twenty-nine, which was during the Haiti earthquake.”

Natural Language Processing & Findings

The coding described above was all done manually by Sarah and research colleagues. But could the ontology she has developed (Information Types) be used to automatically identify tweets that are both on-topic and relevant for situational awareness? To find out, she carried out a study using VerbNet.

“The goal of identifying verbs used in tweets that convey information relevant to situational awareness is to provide a resource that demonstrates which VerbNet classes indicate information relevant to situational awareness. The VerbNet class information can serve as a linguistic feature that provides a classifier with information to identify tweets that contain situational awareness information. VerbNet classes are useful because the classes provide a list of verbs that may not be present in any of the Twitter data I examined, but which may be used to describe similar information in unseen data. In other words, if a particular VerbNet class is relevant to situational awareness, and a classifier identifies a verb in that class that is used in a previously unseen tweet, then that tweet is more likely to be identified as containing situational awareness information.”

Sarah identified 195 verbs that mapped to her Information Types described earlier. The results of using this verb-based ontology are mixed, however. “A majority of tweets do not contain one of the verbs in the identified VerbNet classes, which indicates that additional features are necessary to classify tweets according to the social, built or physical environment.”

However, when applying the 195 verbs to identify on-topic tweets relevant to situational awareness to previously unused Haiti data, Sarah found that using her customized VerbNet ontology resulted in finding 9% more tweets than when using her “Information Types” ontology. In sum, the results show that “using VerbNet classes as a feature is encouraging, but other features are needed to identify tweets that contain situational awareness information, as not all tweets that contain situational awareness information use one of the verb members in the […] identified VerbNet classes. In addition, more research in this area will involve using the semantic and syntactic information contained in each VerbNet class to identify event participants, which can lead to more fine-grained categorization of tweets.”

Conclusion

“Many tweets that communicate situational awareness information do not contain one of the verbs in the identified VerbNet classes, [but] the information provided with named entities and semantic roles can serve as features that classifiers can use to identify situational awareness information in the absence of such a verb. In addition, for tweets correctly identified as containing information relevant to situational awareness, named entities and semantic roles can provide classifiers with additional information to classify these tweets into the social, built and physical environment categories, and into specific information type categories.”

“Finding the best approach toward the automatic identification of situational awareness information communicated in tweets is a task that will involve further training and testing of classifiers.”

Crowdsourcing for Human Rights Monitoring: Challenges and Opportunities for Information Collection & Verification

This new book, Human Rights and Information Communication Technologies: Trends and Consequences of Use, promises to be a valuable resource to both practitioners and academics interested in leveraging new information & communication technologies (ICTs) in the context of human rights work. I had the distinct pleasure of co-authoring a chapter for this book with my good colleague and friend Jessica Heinzelman. We focused specifically on the use of crowdsourcing and ICTs for information collection and verification. Below is the Abstract & Introduction for our chapter.

Abstract

Accurate information is a foundational element of human rights work. Collecting and presenting factual evidence of violations is critical to the success of advocacy activities and the reputation of organizations reporting on abuses. To ensure credibility, human rights monitoring has historically been conducted through highly controlled organizational structures that face mounting challenges in terms of capacity, cost and access. The proliferation of Information and Communication Technologies (ICTs) provide new opportunities to overcome some of these challenges through crowdsourcing. At the same time, however, crowdsourcing raises new challenges of verification and information overload that have made human rights professionals skeptical of their utility. This chapter explores whether the efficiencies gained through an open call for monitoring and reporting abuses provides a net gain for human rights monitoring and analyzes the opportunities and challenges that new and traditional methods pose for verifying crowdsourced human rights reporting.

Introduction

Accurate information is a foundational element of human rights work. Collecting and presenting factual evidence of violations is critical to the success of advocacy activities and the reputation of organizations reporting on abuses. To ensure credibility, human rights monitoring has historically been conducted through highly controlled organizational structures that face mounting challenges in terms of capacity, cost and access.

The proliferation of Information and Communication Technologies (ICTs) may provide new opportunities to overcome some of these challenges. For example, ICTs make it easier to engage large networks of unofficial volunteer monitors to crowdsource the monitoring of human rights abuses. Jeff Howe coined the term “crowdsourcing” in 2006, defining it as “the act of taking a job traditionally performed by a designated agent and outsourcing it to an undefined, generally large group of people in the form of an open call” (Howe, 2009). Applying this concept to human rights monitoring, Molly Land (2009) asserts that, “given the limited resources available to fund human rights advocacy…amateur involvement in human rights activities has the potential to have a significant impact on the field” (p. 2). That said, she warns that professionalization in human rights monitoring “has arisen not because of an inherent desire to control the process, but rather as a practical response to the demands of reporting – namely, the need to ensure the accuracy of the information contained in the report” (Land, 2009, p. 3).

Because “accuracy is the human rights monitor’s ultimate weapon” and the advocate’s “ability to influence governments and public opinion is based on the accuracy of their information,” the risk of inaccurate information may trump any advantages gained through crowdsourcing (Codesria & Amnesty International, 2000, p. 32). To this end, the question facing human rights organizations that wish to leverage the power of the crowd is “whether [crowdsourced reports] can accomplish the same [accurate] result without a centralized hierarchy” (Land, 2009). The answer to this question depends on whether reliable verification techniques exist so organizations can use crowdsourced information in a way that does not jeopardize their credibility or compromise established standards. While many human rights practitioners (and indeed humanitarians) still seem to be allergic to the term crowdsourcing, further investigation reveals that established human rights organizations already use crowdsourcing and verification techniques to validate crowdsourced information and that there is great potential in the field for new methods of information collection and verification.

This chapter analyzes the opportunities and challenges that new and traditional methods pose for verifying crowdsourced human rights reporting. The first section reviews current methods for verification in human rights monitoring. The second section outlines existing methods used to collect and validate crowdsourced human rights information. Section three explores the practical opportunities that crowdsourcing offers relative to traditional methods. The fourth section outlines critiques and solutions for crowdsourcing reliable information. The final section proposes areas for future research.

The book is available for purchase here. Warning: you won’t like the price but at least they’re taking an iTunes approach, allowing readers to purchase single chapters if they prefer. Either way, Jess and I were not paid for our contribution.

For more information on how to verify crowdsourced information, please visit the following links:

  • Information Forensics: Five Case Studies on How to Verify Crowdsourced Information from Social Media (Link)
  • How to Verify and Counter Rumors in Social Media (Link)
  • Social Media and Life Cycle of Rumors during Crises (Link)
  • Truthiness as Probability: Moving Beyond the True or False Dichotomy when Verifying Social Media (Link)
  • Crowdsourcing Versus Putin (Link)
 

 

PeopleBrowsr: Next-Generation Social Media Analysis for Humanitarian Response?

As noted in this blog post on “Data Philanthropy for Humanitarian Response,” members of the Digital Humanitarian Network (DHNetwork) are still using manual methods for media monitoring. When the United Nations Office for the Coordination of Humanitarian Affairs (OCHA) activated the Standby Volunteer Task Force (SBTF) to crisis map Libya last year, for example, SBTF volunteers manually monitored hundreds of Twitter handles, news sites for several weeks.

SBTF volunteers (Mapsters) do not have access to a smart microtasking platform that could have distributed the task in more efficient ways. Nor do they have access to even semi-automated tools for content monitoring and information retrieval. Instead, they used a Google Spreadsheet to list the sources they were manually monitoring and turned this spreadsheet into a sign-up sheet where each Mapster could sign on for 3-hour shifts every day. The SBTF is basically doing “crowd computing” using the equivalent of a typewriter.

Meanwhile, companies like Crimson Hexagon, NetBase, RecordedFuture and several others have each developed sophisticated ways to monitor social and/or mainstream media for various private sector applications such as monitoring brand perception. So my colleague Nazila kindly introduced me to her colleagues at PeopleBrowsr after reading my post on Data Philanthropy. Last week, Marc from PeopleBrowsr gave me a thorough tour of the platform. I was definitely impressed and am excited that Marc wants us to pilot the platform in support of the Digital Humanitarian Network. So what’s the big deal about PeopleBrowsr? To begin with, the platform has access to 1,000 days of social media data and over 3 terabytes of social data per month.

To put this in terms of information velocity, PeopleBrowsr receives 10,000 social media posts per second from a variety of sources including Twitter, Facebook, fora and blogs. On the latter, they monitor posts from over 40 million blogs including all of Tumblr, Posterious, Blogspot and every WordPress-hosted site. They also pull in content from YouTube and Flickr. (Click on the screenshots below to magnify them).

Lets search for the term “tsunami” on Twitter. (One could enter a complex query, e.g., and/or, not, etc., and also search using twitter handles, word or hashtag clouds, top URLs, videos, pictures, etc). PeopleBrowsr summarizes the result by Location and Community. Location simply refers to where those generating content referring to a tsunami are located. Of course, many Twitter users may tweet about an event without actually being eye-witness accounts (think of Diaspora groups, for example). While PeopleBrowsr doesn’t geo-tag the location of reports events, you can very easily and quickly identify which twitter users are tweeting the most about a given event and where they are located.

As for Community, PeopleBrowsr has  indexed millions of social media users and clustered them into different communities based on their profile/bio information. Given our interest in humanitarian response, we could create our own community of social media users from the humanitarian sector and limit our search to those users only. Communities can also be created based on hashtags. The result of the “tsunami” search is displayed below.

This result can be filtered further by gender, sentiment, number of twitter followers, urgent words (e.g., alert, help, asap), time period and location, for example. The platform can monitor and view posts in any language that is posted. In addition, PeopleBrowsr have their very own Kred score which quantifies the “credibility” of social media users. The scoring metrics for Kred scores is completely transparent and also community driven. “Kred is a transparent way to measure influence and outreach in social media. Kred generates unique scores for every domain of expertise. Regardless of follower count, a person is influential if their community is actively listening and engaging with their content.”

Using Kred, PeopleBrows can do influence analysis using Twitter across all languages. They’ve also added Facebook to Kred, but only as an opt in option.  PeopleBrowsr also has some great built-in and interactive data analytics tools. In addition, one can download a situation report in PDF and print that off if there’s a need to go offline.

What appeals to me the most is perhaps the full “drill-down” functionality of PeopleBrowsr’s data analytics tools. For example, I can drill down to the number of tweets per month that reference the word “tsunami” and drill down further per week and per day.

Moreover, I can sort through the individual tweets themselves based on specific filters and even access the underlying tweets complete with twitter handles, time-stamps, Kred scores, etc.

This latter feature would make it possible for the SBTF to copy & paste and map individual tweets on a live crisis map. In fact, the underlying data can be downloaded into a CSV file and added to a Google Spreadsheet for Mapsters to curate. Hopefully the Ushahidi team will also provide an option to upload CSVs to SwiftRiver so users can curate/filter pre-existing datasets as well as content generated live. What if you don’t have time to get on PeopleBrowsr and filter, download, etc? As part of their customer support, PeopleBrowsr will simply provide the data to you directly.

So what’s next? Marc and I are taking the following steps: Schedule online demo of PeopleBrowsr of the SBTF Core Team (they are for now the only members of the Digital Humanitarian Network with a dedicated and experienced Media Monitoring Team); SBTF pilots PeopleBrowsr for preparedness purposes; SBTF deploys  PeopleBrowsr during 2-3 official activations of the Digital Humanitarian Network; SBTF analyzes the added value of PeopleBrowsr for humanitarian response and provides expert feedback to PeopleBrowsr on how to improve the tool for humanitarian response.

Surprising Findings: Using Mobile Phones to Predict Population Displacement After Major Disasters

Rising concerns over the consequences of mass refugee flows during several crises in the late 1970’s is what prompted the United Nations (UN) to call for the establishment of early warning systems for the first time. “In 1978-79 for example, the United Nations and UNHCR were clearly overwhelmed by and unprepared for the mass influx of Indochinese refugees in South East Asia. The number of boat people washed onto the beaches there seriously challenged UNHCR’s capability to cope. One of the issues was the lack of advance information. The result was much human suffering, including many deaths. It took too long for emergency assistance by intergovernmental and non-governmental organizations to reach the sites” (Druke 2012 PDF).

Forty years later, my colleagues at Flowminder are using location data from mobile phones to nowcast and predict population displacement after major disasters. Focusing on the devastating 2010 Haiti earthquake, the team analyzed the movement of 1.9 million mobile users before and after the earthquake. Naturally, the Flowminder team expected that the mass exodus from Port-au-Prince would be rather challenging to predict. Surprisingly, however, the predictability of people’s movements remained high and even increased during the three-month period following the earthquake.

The team just released their findings in a peer-reviewed study entitled: “Predictability of population displacement after the 2010 Haiti earthquake” (PNAS 2012). As the analysis reveals, “the destinations of people who left the capital during the first three weeks after the earthquake was highly correlated with their mobility patterns during normal times, and specifically with the locations in which people had significant social bonds, as measured by where they spent Christmas and New Year holidays” (PNAS 2012).

For the people who left Port-au-Prince, the duration of their stay outside the city, as well as the time for their return, all followed a skewed, fat-tailed distribution. The findings suggest that population movements during disasters may be significantly more predictable than previously thought” (PNAS 2012). Intriguingly, the analysis also revealed the period of time that people in Port-au-Prince waited to leave the city (and then return) was “power-law distributed, both during normal days and after the earthquake, albeit with different exponents (PNAS 2012).” Clearly then, “[p]eople’s movements are highly influenced by their historic behavior and their social bonds, and this fact remained even after one of the most severe disasters in history” (PNAS 2012).

 

I wonder how this approach could be used in combination with crowdsourced satellite imagery analysis on the one hand and with Agent Based Models on the other. In terms of crowdsourcing, I have in mind the work carried out by the Standby Volunteer Task Force (SBTF) in partnership with UNHCR and Tomnod in Somalia last year. SBTF volunteers (“Mapsters”) tagged over a quarter million features that looked liked IDP shelters in under 120 hours, yielding a triangulated country of approximately 47,500 shelters.

In terms of Agent Based Models (ABMs), some colleagues and I  worked on “simulating population displacements following a crisis”  back in 2006 while at the Santa Fe Institute (SFI). We decided to use an Agent Based Model because the data on population movement was simply not within our reach. Moreover, we were particularly interested in modeling movements of ethnic populations after a political crisis and thus within the context of a politically charged environment.

So we included a preference for “safety in numbers” within the model. This parameter can easily be tweaked to reflect a preference for moving to locations that allow for the maintenance of social bonds as identified in the Flowminder study. The figure above lists all the parameters we used in our simple decision theoretic model.

The output below depicts the Agent Based Model in action. The multi-colored panels on the left depict the geographical location of ethnic groups at a certain period of time after the crisis escalates. The red panels on the right depict the underlying social networks and bonds that correspond to the geographic distribution just described. The main variable we played with was the size or magnitude of the sudden onset crisis to determine whether and how people might move differently around various ethnic enclaves. The study long with the results are available in this PDF.

In sum, it would be interesting to carry out Flowminder’s analysis in combination with crowdsourced satellite imagery analysis and live sensor data feeding into an Agent Base Model. Dissertation, anyone?

Muḥammad ibn Mūsā al-Khwārizmī: An Update from the Qatar Computing Research Institute

I first heard of al-Khwārizmī in my ninth-grade computer science class at the International School of Vienna (AIS) back in 1993. Dr. Herman Prossinger who taught the course is exactly the kind of person one describes when answering the question: which teacher had the most impact on you while growing up? I wonder how many other 9th graders in the world had the good fortune of being taught computer science by a full-fledged professor with a PhD dissertation entitled “Isothermal Gas spheres in General Relativity Theory” (1976) and numerous peer-reviewed publications in top-tier scientific journals including Nature?

Muḥammad ibn Mūsā al-Khwārizmī was a brilliant mathematician & astronomer who spent his time as a scholar in the House of Wisdom in Baghdad (possibly the best name of any co-working space in history). “Al-Khwarithmi” was initially transliterated into Latin as Algoritmi. The manuscript above, for example, begins with “DIXIT algorizmi,” meaning “Says al-Khwārizmī.” And thus was born the world AlgorithmBut al-Khwārizmī’s fundamental contributions were not limited to the fields of mathematics and astronomy, he is also well praised for his important work on geography and cartography. Published in 833, his Kitāb ṣūrat al-Arḍ (Arabic: كتاب صورة الأرض) or “Book on the Appearance of the Earth” was a revised and corrected version of Ptolemy’s Geography. al-Khwārizmī’s book comprised an impressive list of 2,402 coordinates of cities and other geo-graphical features. The only surviving copy of the book can be found at Strasbourg University. I’m surprised the item has not yet been purchased by Qatar and relocated to Doha.

View of the bay from QCRI in Doha, Qatar.

This brings me to the Qatar (Foundation) Computing Research Institute (QCRI), which was almost called the al-Khwārizmī Computing Research Institute. I joined QCRI exactly two weeks ago as Director of Social Innovation. My first impression? QCRI is Doha’s “House of Whizzkids”. The team is young, dynamic, international and super smart. I’m already working on several exploratory research and development (R&D) projects that could potentially lead to initial prototypes by the end of the year. These have to do with the application of social computing and big data analysis for humanitarian response. So I’ve been in touch with several colleagues at the United Nations (UN) Office for the Coordination of Humanitarian Affairs (OCHA) to bounce these early ideas off and am thrilled that all responses thus far have been very positive.

My QCRI colleagues and I are also looking into collaborative platforms for “smart microtasking” which may be useful for the Digital Humanitarian Network. In addition, we’re just starting to explore potential solutions for quantifying veracity in social media, a rather non-trivial problem as Dr. Prossinger would often say with a sly smile in relation to NP-hard problems. In terms of partner-ship building, I will be in New York, DC and Boston next month for official meetings with the UN, World Bank and MIT to explore possible collaborations on specific projects. The team in Doha is particularly strong on big data analytics, social computing, data cleaning, machine learning and translation. In fact, most of the whizzkids here come from very impressive track records with Microsoft, Yahoo, Ivy Leagues, etc. So I’m excited by the potential.

View of Tornado Tower (purple lights) where QCRI is located.

The reason I’m not going into specifics vis-a-vis these early R&D efforts is not because I want to be secretive or elusive. Not at all. We’re still refining the ideas ourselves and simply want to manage expectations. There is a very strong and genuine interest within QCRI to contribute meaningfully to the humanitarian technology space. But we’re really just getting started, still hiring left, center and right, and we’ll be in R&D mode for a while. Plus, we don’t want to rush just for the sake of launching a new product. All too often, humanitarian technologies are developed without the benefit (and luxury) of advanced R&D. But if QCRI is going to help shape next-generation humanitarian technology solutions, we should do this in a way that is deliberate, cutting-edge and strategic. That is our comparative advantage.

In sum, the outcome of our R&D efforts may not always lead to a full-fledged prototype, but all the research and findings we carry out will definitely be shared publicly so we can move the field forward. We’re also committed to developing free and open source software as part of our prototyping efforts. Finally, we have no interest in re-inventing the wheel and far prefer working in partnerships than in isolation. So there we go, time to R&D  like al-Khwārizmī.

Big Data Philanthropy for Humanitarian Response

My colleague Robert Kirkpatrick from Global Pulse has been actively promoting the concept of “data philanthropy” within the context of development. Data philanthropy involves companies sharing proprietary datasets for social good. I believe we urgently need big (social) data philanthropy for humanitarian response as well. Disaster-affected communities are increasingly the source of big data, which they generate and share via social media platforms like twitter. Processing this data manually, however, is very time consuming and resource intensive. Indeed, large numbers of digital humanitarian volunteers are often needed to monitor and process user-generated content from disaster-affected communities in near real-time.

Meanwhile, companies like Crimson Hexagon, Geofeedia, NetBase, Netvibes, RecordedFuture and Social Flow are defining the cutting edge of automated methods for media monitoring and analysis. So why not set up a Big Data Philanthropy group for humanitarian response in partnership with the Digital Humanitarian Network? Call it Corporate Social Responsibility (CRS) for digital humanitarian response. These companies would benefit from the publicity of supporting such positive and highly visible efforts. They would also receive expert feedback on their tools.

This “Emergency Access Initiative” could be modeled along the lines of the International Charter whereby certain criteria vis-a-vis the disaster would need to be met before an activation request could be made to the Big Data Philanthropy group for humanitarian response. These companies would then provide a dedicated account to the Digital Humanitarian Network (DHNet). These accounts would be available for 72 hours only and also be monitored by said companies to ensure they aren’t being abused. We would simply need to  have relevant members of the DHNet trained on these platforms and draft the appropriate protocols, data privacy measures and MoUs.

I’ve had preliminary conversations with humanitarian colleagues from the United Nations and DHnet who confirm that “this type of collaboration would be see very positively from the coordination area within the traditional humanitarian sector.” On the business development end, this setup would enable companies to get their foot in the door of the humanitarian sector—a multi-billion dollar industry. Members of the DHNet are early adopters of humanitarian technology and are ideally placed to demonstrate the added value of these platforms since they regularly partner with large humanitarian organizations. Indeed, DHNet operates as a partnership model. This would enable humanitarian professionals to learn about new Big Data tools, see them in action and, possibly, purchase full licenses for their organizations. In sum, data philanthropy is good for business.

I have colleagues at most of the companies listed above and thus plan to actively pursue this idea further. In the meantime, I’d be very grateful for any feedback and suggestions, particularly on the suggested protocols and MoUs. So I’ve set up this open and editable Google Doc for feedback.

Big thanks to the team at the Disaster Information Management Research Center (DIMRC) for planting the seeds of this idea during our recent meeting. Check out their very neat Emergency Access Initiative.