Tag Archives: analysis

Using E-Mail Data to Estimate International Migration Rates

As is well known, “estimates of demographic flows are inexistent, outdated, or largely inconsistent, for most countries.” I would add costly to that list as well. So my QCRI colleague Ingmar Weber co-authored a very interesting study on the use of e-mail data to estimate international migration rates.

The study analyzes a large sample of Yahoo! emails sent by 43 million users between September 2009 and June 2011. “For each message, we know the date when it was sent and the geographic location from where it was sent. In addition, we could link the message with the person who sent it, and with the user’s demographic information (date of birth and gender), that was self reported when he or she signed up for a Yahoo! account. We estimated the geographic location from where each email message was sent using the IP address of the user.”

The authors used data on existing migration rates for a dozen countries and international statistics on Internet diffusion rates by age and gender in order to correct for selection bias. For example, “estimated number of migrants, by age group and gender, is multiplied by a correction factor to adjust for over-representation of more educated and mobile people in groups for which the Internet penetration is low.” The graphs below are estimates of age and gender-specific immigration rates for the Philippines. “The gray area represents the size of the bias correction.” This means that “without any correction for bias, the point estimates would be at the upper end of the gray area.” These methods “correct for the fact that the group of users in the sample, although very large, is not representative of the entire population.”

The results? Ingmar and his co-author Emilio Zagheni were able to “estimate migration rates that are consistent with the ones published by those few countries that compile migration statistics. By using the same method for all geographic regions, we obtained country statistics in a consistent way, and we generated new information for those countries that do not have registration systems in place (e.g., developing countries), or that do not collect data on out-migration (e.g., the United States).” Overall, the study documented a “global trend of increasing mobility,” which is “growing at a faster pace for females than males. The rate of increase for different age groups varies across countries.”

The authors argue that this approach could also be used in the context of “natural” disasters and man-made disasters. In terms of future research, they are interested in evaluating “whether sending a high proportion of e-mail messages to a particular country (which is a proxy for having a strong social network in the country) is related to the decision of actually moving to the country.” Naturally, they are also interested in analyzing Twitter data. “In addition to mobility or migration rates, we could evaluate sentiments pro or against migration for different geographic areas. This would help us understand how sentiments change near an international border or in regions with different migration rates and economic conditions.”

I’m very excited to have Ingmar at QCRI so we can explore these ideas further and in the context of humanitarian and development challenges. I’ve been dis-cussing similar research ideas with my colleagues at UN Global Pulse and there may be a real sweet spot for collaboration here, particularly with the recently launched Pulse Lab in Jakarta.” The possibility of collaborating with my collea-gues at Flowminder could also be really interesting given their important study of population movement following the Haiti Earthquake. In conclusion, I fully share the authors’ sentiment when they highlight the fact that it is “more and more important to develop models for data sharing between private com-panies and the academic world, that allow for both protection of users’ privacy & private companies’ interests, as well as reproducibility in scientific publishing.”

MAQSA: Social Analytics of User Responses to News

Designed by QCRI in partnership with MIT and Al-Jazeera, MAQSA provides an interactive topic-centric dashboard that summarizes news articles and user responses (comments, tweets, etc.) to these news items. The platform thus helps editors and publishers in newsrooms like Al-Jazeera’s better “understand user engagement and audience sentiment evolution on various topics of interest.” In addition, MAQSA “helps news consumers explore public reaction on articles relevant to a topic and refine their exploration via related entities, topics, articles and tweets.” The pilot platform currently uses Al-Jazeera data such as Op-Eds from Al-Jazeera English.

Given a topic such as “The Arab Spring,” or “Oil Spill”, the platform combines time, geography and topic to “generate a detailed activity dashboard around relevant articles. The dashboard contains an annotated comment timeline and a social graph of comments. It utilizes commenters’ locations to build maps of comment sentiment and topics by region of the world. Finally, to facilitate exploration, MAQSA provides listings of related entities, articles, and tweets. It algorithmically processes large collections of articles and tweets, and enables the dynamic specification of topics and dates for exploration.”

While others have tried to develop similar dashboards in the past, these have “not taken a topic-centric approach to viewing a collection of news articles with a focus on their user comments in the way we propose.” The team at QCRI has since added a number of exciting new features for Al-Jazeera to try out as widgets on their site. I’ll be sure to blog about these and other updates when they are officially launched. Note that other media companies (e.g., UK Guardian) will also be able to use this platform and widgets once they become public.

As always with such new initiatives, my very first thought and question is: how might we apply them in a humanitarian context? For example, perhaps MAQSA could be repurposed to do social analytics of responses from local stakeholders with respect to humanitarian news articles produced by IRIN, an award-winning humanitarian news and analysis service covering the parts of the world often under-reported, misunderstood or ignored. Perhaps an SMS component could also be added to a MAQSA-IRIN platform to facilitate this. Or perhaps there’s an application for the work that Internews carries out with local journalists and consumers of information around the world. What do you think?

The Best Way to Crowdsource Satellite Imagery Analysis for Disaster Response

My colleague Kirk Morris recently pointed me to this very neat study on iterative versus parallel models of crowdsourcing for the analysis of satellite imagery. The study was carried out by French researcher & engineer Nicolas Maisonneuve for the next GISscience2012 conference.

Nicolas finds that after reaching a certain threshold, adding more volunteers to the parallel model does “not change the representativeness of opinion and thus will not change the consensual output.” His analysis also shows that the value of this threshold has significant impact on the resulting quality of the parallel work and thus should be chosen carefully.  In terms of the iterative approach, Nicolas finds that “the first iterations have a high impact on the final results due to a path dependency effect.” To this end, “stronger commitment during the first steps are thus a primary concern for using such model,” which means that “asking expert/committed users to start,” is important.

Nicolas’s study also reveals that the parellel approach is better able to correct wrong annotations (wrong analysis of the satellite imagery) than the iterative model for images that are fairly straightforward to interpret. In contrast, the iterative model is better suited for handling more ambiguous imagery. But there is a catch: the potential path dependency effect in the iterative model means that  “mistakes could be propagated, generating more easily type I errors as the iterations proceed.” In terms of spatial coverage, the iterative model is more efficient since the parallel model leverages redundancy to ensure data quality. Still, Nicolas concludes that the “parallel model provides an output which is more reliable than that of a basic iterative [because] the latter is sensitive to vandalism or knowledge destruction.”

So the question that naturally follow is this: how can parallel and iterative methodologies be combined to produce a better overall result? Perhaps the parallel approach could be used as the default to begin with. However, images that are considered difficult to interpret would get pushed from the parallel workflow to the iterative workflow. The latter would first be processed by experts in order to create favorable path dependency. Could this hybrid approach be the wining strategy?

Big Data Philanthropy for Humanitarian Response

My colleague Robert Kirkpatrick from Global Pulse has been actively promoting the concept of “data philanthropy” within the context of development. Data philanthropy involves companies sharing proprietary datasets for social good. I believe we urgently need big (social) data philanthropy for humanitarian response as well. Disaster-affected communities are increasingly the source of big data, which they generate and share via social media platforms like twitter. Processing this data manually, however, is very time consuming and resource intensive. Indeed, large numbers of digital humanitarian volunteers are often needed to monitor and process user-generated content from disaster-affected communities in near real-time.

Meanwhile, companies like Crimson Hexagon, Geofeedia, NetBase, Netvibes, RecordedFuture and Social Flow are defining the cutting edge of automated methods for media monitoring and analysis. So why not set up a Big Data Philanthropy group for humanitarian response in partnership with the Digital Humanitarian Network? Call it Corporate Social Responsibility (CRS) for digital humanitarian response. These companies would benefit from the publicity of supporting such positive and highly visible efforts. They would also receive expert feedback on their tools.

This “Emergency Access Initiative” could be modeled along the lines of the International Charter whereby certain criteria vis-a-vis the disaster would need to be met before an activation request could be made to the Big Data Philanthropy group for humanitarian response. These companies would then provide a dedicated account to the Digital Humanitarian Network (DHNet). These accounts would be available for 72 hours only and also be monitored by said companies to ensure they aren’t being abused. We would simply need to  have relevant members of the DHNet trained on these platforms and draft the appropriate protocols, data privacy measures and MoUs.

I’ve had preliminary conversations with humanitarian colleagues from the United Nations and DHnet who confirm that “this type of collaboration would be see very positively from the coordination area within the traditional humanitarian sector.” On the business development end, this setup would enable companies to get their foot in the door of the humanitarian sector—a multi-billion dollar industry. Members of the DHNet are early adopters of humanitarian technology and are ideally placed to demonstrate the added value of these platforms since they regularly partner with large humanitarian organizations. Indeed, DHNet operates as a partnership model. This would enable humanitarian professionals to learn about new Big Data tools, see them in action and, possibly, purchase full licenses for their organizations. In sum, data philanthropy is good for business.

I have colleagues at most of the companies listed above and thus plan to actively pursue this idea further. In the meantime, I’d be very grateful for any feedback and suggestions, particularly on the suggested protocols and MoUs. So I’ve set up this open and editable Google Doc for feedback.

Big thanks to the team at the Disaster Information Management Research Center (DIMRC) for planting the seeds of this idea during our recent meeting. Check out their very neat Emergency Access Initiative.

Using Rayesna to Track the 2012 Egyptian Presidential Candidates on Twitter

My (future) colleague at the Qatar Foundation’s Computing Research Institute (QCRI) have just launched a new platform that Al Jazeera is using to track the 2012 Egyptian Presidential Candidates on Twitter. Called Rayesna, which  means “our president” in colloquial Egyptian Arabic, this fully automated platform uses cutting-edge Arabic computational linguistics processing developed by the Arabic Language Technology (ALT) group at QCRI.

“Through Rayesna, you can find out how many times a candidate is mentioned, which other candidate he is likely to appear with, and the most popular tweets for a candidate, with a special category for the most retweeted jokes about the candidates. The site also has a time-series to explore and compares the mentions of the candidate day-by-day. Caveats: 1. The site reflects only the people who choose to tweet, and this group may not be representative of general society; 2. Tweets often contain foul language and we do not perform any filtering.”

I look forward to collaborating with the ALT group and exploring how their platform might also be used in the context of humanitarian response in the Arab World and beyond. There may also be important synergies with the work of the UN Global Pulse, particularly vis-a-vis their use of Twitter for real-time analysis of vulnerable communities.

Crisis Mapping Syria: Automated Data Mining and Crowdsourced Human Intelligence

The Syria Tracker Crisis Map is without doubt one of the most impressive crisis mapping projects yet. Launched just a few weeks after the protests began one year ago, the crisis map is spearheaded by a just handful of US-based Syrian activists have meticulously and systematically documented 1,529 reports of human rights violations including a total of 11,147 killings. As recently reported in this NewScientist article, “Mapping the Human Cost of Syria’s Uprising,” the crisis map “could be the most accurate estimate yet of the death toll in Syria’s uprising […].” Their approach? “A combination of automated data mining and crowdsourced human intelligence,” which “could provide a powerful means to assess the human cost of wars and disasters.”

On the data-mining side, Syria Tracker has repurposed the HealthMap platform, which mines thousands of online sources for the purposes of disease detection and then maps the results, “giving public-health officials an easy way to monitor local disease conditions.” The customized version of this platform for Syria Tracker (ST), known as HealthMap Crisis, mines English information sources for evidence of human rights violations, such as killings, torture and detainment. As the ST Team notes, their data mining platform “draws from a broad range of sources to reduce reporting biases.” Between June 2011 and January 2012, for example, the platform collected over 43,o00 news articles and blog posts from almost 2,000 English-based sources from around the world (including some pro-regime sources).

Syria Tracker combines the results of this sophisticated data mining approach with crowdsourced human intelligence, i.e., field-based eye-witness reports shared via webform, email, Twitter, Facebook, YouTube and voicemail. This naturally presents several important security issues, which explains why the main ST website includes an instructions page detailing security precautions that need to be taken while sub-mitting reports from within Syria. They also link to this practical guide on how to protect your identity and security online and when using mobile phones. The guide is available in both English and Arabic.

Eye-witness reports are subsequently translated, geo-referenced, coded and verified by a group of volunteers who triangulate the information with other sources such as those provided by the HealthMap Crisis platform. They also filter the reports and remove dupli-cates. Reports that have a low con-fidence level vis-a-vis veracity are also removed. Volunteers use a dig-up or vote-up/vote-down feature to “score” the veracity of eye-witness reports. Using this approach, the ST Team and their volunteers have been able to verify almost 90% of the documented killings mapped on their platform thanks to video and/or photographic evidence. They have also been able to associate specific names to about 88% of those reported killed by Syrian forces since the uprising began.

Depending on the levels of violence in Syria, the turn-around time for a report to be mapped on Syria Tracker is between 1-3 days. The team also produces weekly situation reports based on the data they’ve collected along with detailed graphical analysis. KML files that can be uploaded and viewed using Google Earth are also made available on a regular basis. These provide “a more precisely geo-located tally of deaths per location.”

In sum, Syria Tracker is very much breaking new ground vis-a-vis crisis mapping. They’re combining automated data mining technology with crowdsourced eye-witness reports from Syria. In addition, they’ve been doing this for a year, which makes the project the longest running crisis maps I’ve seen in a hostile environ-ment. Moreover, they’ve been able to sustain these import efforts with just a small team of volunteers. As for the veracity of the collected information, I know of no other public effort that has taken such a meticulous and rigorous approach to documenting the killings in Syria in near real-time. On February 24th, Al-Jazeera posted the following estimates:

Syrian Revolution Coordination Union: 9,073 deaths
Local Coordination Committees: 8,551 deaths
Syrian Observatory for Human Rights: 5,581 deaths

At the time, Syria Tracker had a total of 7,901 documented killings associated with specific names, dates and locations. While some duplicate reports may remain, the team argues that “missing records are a much bigger source of error.” Indeed, They believe that “the higher estimates are more likely, even if one chooses to disregard those reports that came in on some of the most violent days where names were not always recorded.”

The Syria Crisis Map itself has been viewed by visitors from 136 countries around the world and 2,018 cities—with the top 3 cities being Damascus, Washington DC and, interestingly, Riyadh, Saudia Arabia. The witnessing has thus been truly global and collective. When the Syrian regime falls, “the data may help sub-sequent governments hold him and other senior leaders to account,” writes the New Scientist. This was one of the principle motivations behind the launch of the Ushahidi platform in Kenya over four years ago. Syria Tracker is powered by Ushahidi’s cloud-based platform, Crowdmap. Finally, we know for a fact that the International Criminal Court (ICC) and Amnesty International (AI) closely followed the Libya Crisis Map last year.

Twitter, Crises and Early Detection: Why “Small Data” Still Matters

My colleagues John Brownstein and Rumi Chunara at Harvard Univer-sity’s HealthMap project are continuing to break new ground in the field of Digital Disease Detection. Using data obtained from tweets and online news, the team was able to identify a cholera outbreak in Haiti weeks before health officials acknowledged the problem publicly. Meanwhile, my colleagues from UN Global Pulse partnered with Crimson Hexagon to forecast food prices in Indonesia by carrying out sentiment analysis of tweets. I had actually written this blog post on Crimson Hexagon four years ago to explore how the platform could be used for early warning purposes, so I’m thrilled to see this potential realized.

There is a lot that intrigues me about the work that HealthMap and Global Pulse are doing. But one point that really struck me vis-a-vis the former is just how little data was necessary to identify the outbreak. To be sure, not many Haitians are on Twitter and my impression is that most humanitarians have not really taken to Twitter either (I’m not sure about the Haitian Diaspora). This would suggest that accurate, early detection is possible even without Big Data; even with “Small Data” that is neither representative or indeed verified. (Inter-estingly, Rumi notes that the Haiti dataset is actually larger than datasets typically used for this kind of study).

In related news, a recent peer-reviewed study by the European Commi-ssion found that the spatial distribution of crowdsourced text messages (SMS) following the earthquake in Haiti were strongly correlated with building damage. Again, the dataset of text messages was relatively small. And again, this data was neither collected using random sampling (i.e., it was crowdsourced) nor was it verified for accuracy. Yet the analysis of this small dataset still yielded some particularly interesting findings that have important implications for rapid damage detection in post-emergency contexts.

While I’m no expert in econometrics, what these studies suggests to me is that detecting change-over–time is ultimately more critical than having a large-N dataset, let alone one that is obtained via random sampling or even vetted for quality control purposes. That doesn’t mean that the latter factors are not important, it simply means that the outcome of the analysis is relatively less sensitive to these specific variables. Changes in the baseline volume/location of tweets on a given topic appears to be strongly correlated with offline dynamics.

What are the implications for crowdsourced crisis maps and disaster response? Could similar statistical analyses be carried out on Crowdmap data, for example? How small can a dataset be and still yield actionable findings like those mentioned in this blog post?

How Crisis Mapping Proved Henry Kissinger Wrong in Cambodia

Crisis Mapping can reveal insights on current crises as well as crises from decades ago. Take Dr. Jen Ziemke‘s dissertation research on crisis mapping the Angolan civil war, which revealed and explained patterns of violence against civilians. My colleague Dr. Taylor Owen recently shared with me his fascinating research, which comprises a spatio-historical analysis of the US bombardment of Cambodia. Like Jen’s research, Taylor’s clearly shows how crisis mapping can shed new light on important historical events.


Taylor analyzed a recently declassified Pentagon geo-referenced data set of all US bombings during the Indo-Chinese war which revealed substantial errors in the historical record of what happened to Cambodia between 1965-1973. The spatial and temporal analysis also adds more food for thought regarding the link between the rise of the Khmer Rouge and American air strikes. In particular, Owen’s analysis shows that:

“… the total tonnage dropped on Cambodia was five times greater than previously known; the bombing inside Cambodia began nearly 4 years prior to the supposed start of the Menu Campaign, under the Johnson Administration; that, in contradiction to Henry Kissinger’s claims, and over the warning of the Joints Chiefs of Staff, Base Areas 704, 354 and 707 were all heavily bombed; the bombing intensity increased throughout the summer of 1973, after Congress barred any such increase; and, that despite claims by both Kissinger and Nixon to the contrary, there was substantial bombing within 1km of inhabited villages.”

To be sure, the crisis mapping analysis of Cambodia “transforms our understan-ding of the scale of what happened to Cambodia during the Indochinese war. The  total tonnage of bombs dropped on the country had previously been pegged at some 500,000 tons. The new analysis dramatically revises this figure upwards to “2,756,941 tons of US bombs dropped during no fewer than 230,516 sorties.” To put this figure into context, more bombs were dropped on Cambodia than the number of bombs that the US dropped during all of World War II. Cambodia remains the most heavily bombed country in the world.

Kissinger had claimed that no bombs were being dropped on villages. He gave assurances, in writing, that no bombs would be dropped “closer than 1 km from villages, hamlets, houses, monuments, temples, pagodas or holy places.” As Owen reveals, “the absurdity of Kissinger’s claim is clearly demonstrated” by the crisis mapping analysis below in which the triangles represent village centers and the red points denote bombing targets, often hit with multiple sorties.

Owen argues that “while the villagers may well have hated the Viet Cong, in many cases once their villages had been bombed, they would become more sympathetic to the Khmer Rouge,” hence the supposed link between the eventual Cambodian genocide which killed 1.7 million people (~21% of the population) and the US bombing. To be sure,  “the civilian casualties caused by the bombing significantly increased the recruiting capacity of the Khmer Rouge, whom over the course of the bombing campaign transformed from a small agrarian revolutionary group, to a large anti-imperial army capable of taking over the country.”

In sum, the crisis mapping analysis of Cambodia “challenges both the established historical narrative on the scale and scope of this campaign, as well as our understanding of the effects of large scale aerial bombardment.”

Applying Earthquake Physics to Conflict Analysis

I really enjoyed speaking with Captain Wayner Porter whilst at PopTech 2011 last week. We both share a passion for applying insights from complexity science to different disciplines. I’ve long found the analogies between earthquakes and conflicts intriguing. We often talk of geopolitical fault lines, mounting tensions and social stress. “If this sounds at all like the processes at work in the Earth’s crust, where stresses build up slowly to be released in sudden earthquakes … it may be no coincidence” (Buchanan 2001).

To be sure, violent conflict is “often like an earthquake: it’s caused by the slow accumulation of deep and largely unseen pressures beneath the surface of our day-to-day affairs. At some point these pressures release their accumulated energy with catastrophic effect, creating shock waves that pulverize our habitual and often rigid ways of doing things…” (Homer-Dixon 2006).

But are fore shocks and aftershocks in social systems really as discernible as well? Like earthquakes, both inter-state and internal wars actually occur with the same statistical pattern (see my previous blog post on this). Since earthquakes and conflicts are complex systems, they also exhibit emergent features associated with critical states. In sum, “the science of earthquakes […] can help us understand sharp and sudden changes in types of complex systems that aren’t geological–including societies…” (Homer-Dixon 2006).

Back in 2006, I collaborated with Professor Didier Sornette and Dr. Ryan Woodard from the Swiss Federal Institute of Technology (ETHZ) to assess whether a mathematical technique developed for earthquake prediction might shed light on conflict dynamics. I presented this study along with our findings at the American Political Science Association (APSA) convention last year (PDF). This geophysics technique, “superposed epoch analysis,” is used to identify statistical signatures before and after earthquakes. In other words, this technique allows us to discern whether any patterns are discernible in the data during foreshocks and aftershocks. Earthquake physicists work from global spatial time series data of seismic events to develop models for earthquake prediction. We used a global time series dataset of conflict events generated from newswires over a 15-year period. The graph below explains the “superposed epoch analysis” technique as applied to conflict data.

eqphysics

The curve above represents a time series of conflict events (frequency) over a particular period of time. We select arbitrary threshold, such as “threshold A” denoted by the dotted line. Every peak that crosses this threshold is then “copied” and “pasted” into a new graph. That is, the peak, together with the data points 25 days prior to and following the peak is selected.

The peaks in the new graph are then superimposed and aligned such that the peaks overlap precisely. With “threshold A”, two events cross the threshold, five for “threshold B”. We then vary the thresholds to look for consistent behavior and examine the statistical behavior of the 25 days before and after the “extreme” conflict event. For this study, we performed the computational technique described above on the conflict data for the US, UK, Afghanistan, Columbia and Iraq.

Picture 4Picture 5Picture 6

The foreshock and aftershock behaviors in Iraq and Afghanistan appear to be similar. Is this because the conflicts in both countries were the result of external intervention, i.e., invasion by US forces (exogenous shock)?

In the case of Colombia, an internal low intensity and protracted conflict, the statistical behavior of foreshocks and aftershocks are visibly different from those of Iraq and Afghanistan. Do the different statistical behaviors point to specific signature associated with exogenous and endogenous causes of extreme events? Does one set of behavior contrast with another one in the same way that old wars and new wars differ?

Are certain extreme events endogenous or exogenous in nature? Can endogenous or exogenous signatures be identified? In other words, are extreme events just part of the fat tail of a power law due to self-organized criticality (endogeneity)? Or is catastrophism in action, extreme events require extreme causes outside the system (exogeneity)?

Another possibility still is that extreme events are the product of both endo-genous and exogenous effects. How would this dynamic unfold? To answer these questions, we need to go beyond political science. The distinction between responses to endogenous and exogenous processes is a fundamental property of physics and is quantified as the fluctuation-dissipation theorem in statistical mechanics. This theory has been successfully applied to social systems (such as books sales) as a way to help understand different classes of causes and effects.

Questions for future research: Do conflict among actors in social systems display measurable endogenous and exogenous behavior? If so, can a quantitative signature of precursory (endogenous) behavior be used to help recognize and then reduce growing conflict? The next phase of this research will be to apply the above techniques to the conflict dataset already used to examine the statistical behavior of foreshocks and aftershocks.

Help Crowdsource Satellite Imagery Analysis for Syria: Building a Library of Evidence

Update: Project featured on UK Guardian Blog! Also, for the latest on the project, please see this blog post.

This blog post follows from this previous one: “Syria – Crowdsourcing Satellite Imagery Analysis to Identify Mass Human Rights Violations.” As part of the first phase of this project, we are building a library of satellite images for features we want to tag using crowdsourcing.

In particular, we are looking to identify the following evidence using high-resolution satellite imagery:

  • Large military equipment
  • Large crowds
  • Checkpoints
The idea is to provide volunteers the Standby Volunteer Task Force (SBTF) Satellite Team with as much of road map as possible so they know exactly what they’re looking for in the  satellite imagery they’ll be tagging using the Tomnod system:

Here are some of the pictures we’ve been able to identify thanks to the help of my good colleague Christopher Albon:
I’ve placed these and other examples in this Google Doc which is open for comment. We need your help to provide us with other imagery depicting heavy Syrian military equipment, large crowds and checkpoints. Please provide links and screenshots of such imagery in this open and editable Google Doc.Here are some of the links that Chris already sent us for the above imagery: