Category Archives: Crisis Mapping

Part 4: Automated Analysis and Uncertainty Visualized

This is Part 4 of 7 of the highlights from “Illuminating the Path: The Research and Development Agenda for Visual Analytics.” Please see this post for an introduction to the study and access to the other 6 parts.

As data flooding increases, the human eye may have difficulty focusing on patterns. To this end, VA systems should have “semi-automated analytic engines and user-driven interfaces.” Indeed, “an ideal environment for analysis would have a seamless integration of computational and visual techniques.”

For example, “the visual overview may be based on some preliminary data transformations […]. Interactive focusing, selecting, and filtering could be used to isolate data associated with a hypothesis, which could then be passed to an analysis engine with informed parameter settings. Results could be superimposed on the original information to show the difference between the raw data and the computed model, with errors highlighted visually.”

Yet current mathematical techniques “for representing pattern and structure, as well as visualizing correlations, time patterns, metadata relationships, and networks of linked information,” do not work well “for more complex reasoning tasks—particularly temporal reasoning and combined time and space reasoning […], much work remains to be done.” Furthermore, “existing techniques also fail when faced with the massive scale, rapidly changing data, and variety of information types we expect for visual analytics tasks.”

Furthermore, “the complexity of this problem will require algorithmic advances to address the establishment and maintenance of uncertainty measures at varying levels of data abstraction.” There is presently “no accepted methodology to represent potentially erroneous information, such as varying precision, error, conflicting evidence, or incomplete information.”

To this end, “interactive visualization methods are needed that allow users to see what is missing, what is known, what is unknown, and what is conjectured, so that they may infer possible alternative explanations.”

In sum, “uncertainty must be displayed if it is to be reasoned with and incorporated into the visual analytics process. In existing visualizations, much of the information is displayed as if it were true.”

Patrick Philippe Meier

Part 3: Data Tetris and Information Synthesis

This is Part 3 of 7 of the highlights from “Illuminating the Path: The Research and Development Agenda for Visual Analytics.” Please see this post for an introduction to the study and access to the other 6 parts.

Visual Analytics (VA) tools need to integrate and visualize different data types. But the integration of this data needs to be “based on their meaning rather than the original data type” in order to “facilitate knowledge discovery through information synthesis.” However, “many existing visual analytics systems are data-type-centric. That is, they focus on a particular type of data […].”

We know that different types of data are regularly required to conduct solid anlaysis, so developing a data synthesis capability is particularly important. This means ability to “bring data of different types together in a single environment […] to concentrate on the meaning of the data rather than on the form in which it was originally packaged.”

To be sure, information synthesis needs to “extend beyond the current data-type modes of analysis to permit the analyst to consider dynamic information of all types in seamless environment.” So we need to “eliminate the artificial constraints imposed by data type so that we can aid the analyst in reaching deeper analytical insight.”

To this end, we need breakthroughs in “automatic or semi-automatic approaches for identifying [and coding] content of imagery and video data.” A semi-automatic approach could draw on crowdsourcing, much like Ushahidi‘s Swift River.

In other words, we need to develop visual analytical tools that do not force the analyst to “perceptually and cognitively integrate multiple elements. […] Systems that force a user to view sequence after sequence of information are time-consuming and error-prone.” New techniques are also needed to do away with the separation of ‘what I want and the act of doing it.'”

Patrick Philippe Meier

Armed Conflict and Location Event Dataset (ACLED)

I joined the Peace Research Institute, Oslo (PRIO) as a researcher in 2006 to do some data development work on a conflict dataset and to work with Norways’ former Secretary of State on assessing the impact of armed conflict on women’s health for the Ministry of Foreign Affairs (MFA).

I quickly became interested in a related PRIO project that had recently begun called the “Armed Conflict and Location Event Dataset, or ACLED. Having worked with conflict event-datasets as part of operational conflict early warning systems in the Horn, I immediately took interest in the project.

While I have referred to ACLED in a number of previous blog posts, two of my main criticisms (until recently) were (1) the lack of data on recent conflicts; and (2) the lack of an interactive interface for geospatial analysis, or at least more compelling visualization platform.

Introducing SpatialKey

Independently, I came across UniveralMind back November of last year when Andrew Turner at GeoCommons made a reference to the group’s work in his presentation at an Ushahidi meeting. I featured one of the group’s products, SpatialKey, in my recent video primer on crisis mapping.

As it turns out, ACLED is now using SpatialKey to visualize and analyze some of it’s data. So the team has definitely come a long way from using ArcGIS and Google Earth, which is great. The screenshot below, for example, depicts the ACLED data on Kenya’s post-election violence using SpatialKey.

ACLEDspatialkey

If the Kenya data is not drawn from the Ushahidi then this could be an exciting research opportunity to compare both datasets using visual analysis and applied geo-statistics. I write “if” because PRIO somewhat surprisingly has not made the Kenya data available. They are usually very transparent so I will follow up with them and hope to get the data. Anyone interested in co-authoring this study?

Academics Get up To Speed

It’s great to see ACLED developing conflict data for more recent conflicts. Data on Chad, Sudan and the Central African Republic (CAR) is also depicted using SpatialKey but again the underlying spreadsheet data does not appear to be available regrettably. If the data were public, then the UN’s Threat and Risk Mapping Analysis (TRMA) project may very well have much to gain from using the data operationally.

ACLEDspatialkey2

Data Hugging Disorder

I’ll close with just one—perhaps unwarranted—concern since I still haven’t heard back from ACLED about accessing their data. As academics become increasingly interested in applying geospatial analysis to recent or even current conflicts by developing their own datasets (a very positive move for sure), will these academics however keep their data to themselves until they’ve published an article in a peer-reviewed journal, which can often take up to a year or more to publish?

To this end I share the concern that my colleague Ed Jezierski from InSTEDD articulated in his excellent blog post yesterday: “Academic projects that collect data with preference towards information that will help to publish a paper rather than the information that will be the most actionable or help community health the most.” Worst still, however, would be academics collecting data very relevant to the humanitarian or human rights community and not sharing that data until their academic papers are officially published.

I don’t think there needs to be competition between scholars and like-minded practitioners. There are increasingly more scholar-practitioners who recognize that they can contributed their research and skills to the benefit of the humanitarian and human rights communities. At the same time, the currency of academia remains the number of peer-reviewed publications. But humanitarian practitioners can simply sign an agreement such that anyone using the data for humanitarian purposes cannot publish any analysis of said data in a peer-reviewed forum.

Thoughts?

Patrick Philippe Meier

Part 2: Data Flooding and Platform Scarcity

This is Part 2 of 7 of the highlights from “Illuminating the Path: The Research and Development Agenda for Visual Analytics.” Please see this post for an introduction to the study and access to the other 6 parts.

Data Flooding

Data flooding is a term I use to illustrate the fact that “our ability to collect data is increasing at a faster rate than our ability to analyze it.” To this end, I completely agree with the recommendation that new methods are required to “allow the analyst to examine this massive, multi-dimensional, multi-source, time-varying information stream to make decisions in a time-critical manner.”

We don’t want less, but rather more information since “large data volumes allow analysts to discover more complete information about a situation.” To be sure, “scale brings opportunities as well.” As a result, for example, “analysts may be able to determine more easily when expected information is missing,” which sometimes “offers important clues […].”

However, while computer processing power and memory density have changed radically over the decades, “basic human skills and abilities do not change significantly over time.” Technological advances can certainly leverage our skills “but there are fundamental limits that we are asymptotically approaching,” hence the notion of information glut.

In other words, “human skills and abilities do not scale.” That said, the number of humans involved in analytical problem-solving does scale. Unfortunately, however, “most published techniques for supporting analysis are targeted for a single user at a time.” This means that new techniques that “gracefully scale from a single user to a collaborative (multi-user) environment” need to be developed.

Platform Scarcity

However, current technologies and platforms being used in the humanitarian and human rights communities do not address the needs for handling ever-changing volumes of information. “Furthermore, current tools provide very little in the way of support for the complex tasks of anlaysis and discovery process.” There clearly is a platform scarcity.

Admittedly, “creating effective visual representations is a labor-intensive process that requires a solid understanding of the visualization pipeline, characteristics of the data to be displayed, and the tasks to be performed.”

However, as is clear from the crisis mapping projects I have consulted on, “most visualization software is written with incomplete knowledge of at least some of this information.” Indeed, it is rarely possible for “the analyst, who has the best understanding of the data and task, to construct new tools.”

The NVAC study thus recommends that “research is needed to create software that supports the most complex and time-consuming portions of the analytical process, so that analysts can respond to increasingly more complex questions.” To be sure, “we need real-time analytical monitoring that can alert first responders to unusual situations in advance.”

Patrick Philippe Meier

Part 1: Visual Analytics

This is Part 1 of 7 of the highlights from “Illuminating the Path: The Research and Development Agenda for Visual Analytics.” Please see this post for an introduction to the study and access to the other 6 parts.

NVAC defines Visual Analytics (VA) as “the science of analytical reasoning facilitated by interactive visual interfaces. People use VA tools and techniques to synthesize information and derive insights from massive, dynamic, ambiguous, and often conflicting data; detect the expected and discover the unexpected; provide timely, defensible, and understandable assessments; and communicate assessment effectively for action.”

The field of VA is necessarily multidisciplinary and combines “techniques from information visualization with techniques from computational transformation and analysis of data.” VA includes the following focus areas:

  • Analytical reasoning techniques, “that enable users to obtain deep insights that directly support assessment, planning and decision-making”;
  • Visual representations and interaction techniques, “that take advantage of the human eye’s broad bandwidth pathway to into the mind to allow users to see, explore, and understand large amounts of information at once”;
  • Data representation and transformations, “that convert all types of conflicting and dynamic data in ways that support visualization and analysis”;
  • Production, presentation and dissemination techniques, “to communicate information in the appropriate context to a variety of audiences.”

As is well known, “the human mind can understand complex information received through visual channels.” The goal of VA is thus to facilitate the analytical reasoning process “through the creation of software that maximizes human capacity to perceive, understand, and reason about complex and dynamic situations.”

In sum, “the goal is to facilitate high-quality human judgment with a limited investment of the analysts’ time.” This means in part to “expose all relevant data in a way that facilitates the reasoning process to enable action.” To be sure, solving a problem often means representing it so that the solution is more obvious (adapted from Herbert Simon). Sometimes, the simple act of placing information on a timeline or a map can generate clarity and profound insight.” Indeed, both “temporal relationships and spatial patterns can be revealed through timelines and maps.”

VA also reduces the costs associated with sense-making in two primary ways, by:

  1. Transforming information into forms that allow humans to offload cognition onto easier perceptual processes;
  2. Allowing software agents to do some of the filtering, representation translation, interpretation, and even reasoning.

That said, we should keep in mind that “human-designed visualizations are still much better than those created by our information visualization systems.” That is, there are more “highly evolved and widely used metaphors created by human information designers” than there are “successful new computer-mediated visual representations.”

Patrick Philippe Meier

Research Agenda for Visual Analytics

I just finished reading “Illuminating the Path: The Research and Development Agenda for Virtual Analytics.” The National Visualization and Analytics Center (NVACs) published the 200-page book in 2004 and the volume is absolutely one of the best treaties I’ve come across on the topic yet. The purpose of this series of posts that follow is to share some highlights and excerpts relevant for crisis mapping.

NVACcover

Co-edited by James Thomas and Kristin Cook,  the book focuses specifically on homeland security but there are numerous insights to be gained on how “virtual analytics” can also illuminate the path for crisis mapping analytics. Recall that the field of conflict early warning originated in part from World War II and  the lack of warning during Pearl Harbor.

Several coordinated systems for the early detection of a Soviet bomber attack on North America were set up in the early days of the Cold War. The Distant Early Warning Line, or Dew Line, was the most sophisticated of these. The point to keep in mind is that the national security establishment is often in the lead when it comes to initiatives that can also be applied for humanitarian purposes.

The motivation behind the launching of NVACs and this study was 9/11. In my opinion, this volume goes a long way to validating the field of crisis mapping. I highly recommend it to colleagues in both the humanitarian and human rights communities. In fact, the book is directly relevant to my current consulting work with the UN’s Threat and Risk Mapping Analysis (TRMA) project in the Sudan.

So this week, iRevolution will be dedicated to sharing daily higlights from the NVAC study. Taken together, these posts will provide a good summary of the rich and in-depth 200-page study. So check back here post for live links to NVAC highlights:

Part 1: Visual Analytics

Part 2: Data Flooding and Platform Scarcity

Part 3: Data Tetris and Information Synthesis

Part 4: Automated Analysis and Uncertainty Visualized

Part 5: Data Visualization and Interactive Interface Design

Part 6: Mobile Technologies and Collaborative Analytics

Part 7: Towards a Taxonomy of Visual Analytics

Note that the sequence above does not correspond to specific individual chapters in the NVAC study. This structure for the summary is what made most sense.

Patrick Philippe Meier

UN Sudan Information Management Working (Group)

I’m back in the Sudan to continue my work with the UNDP’s Threat and Risk Mapping Analysis (TRMA) project. UN agencies typically suffer from what a colleague calls “Data Hugging Disorder (DHD),” i.e., they rarely share data. This is generally the rule, not the exception.

UN Exception

There is an exception, however: the recently established UN’s Information Management Working Group (IMWG) in the Sudan. The general goal of the IMWG is to “facilitate the development of a coherent information management approach for the UN Agencies and INGOs in Sudan in close cooperation with local authorities and institutions.”

More specifically, the IMWG seeks to:

  1. Support and advise the UNDAF Technical Working Groups and Work Plan sectors in the accessing and utilization of available data for improved development planning and programming;
  2. Develop/advise on the development of, a Sudan-specific tool, or set of tools, to support decentralized information-sharing and common GIS mapping, in such a way that it will be consistent with the DevInfo system development, and can eventually be adopted/integrated as a standard plug-in for the same.

To accomplish these goals, the IMWG will collectively assume a number of responsibilities including the following:

  • Agree on  information sharing protocols, including modalities of shared information update;
  • Review current information management mechanisms to have a coherent approach.

The core members of the working group include: IOM, WHO, FAO, UNICEF, UNHCR, UNPFA, WFP, OCHA and UNDP.

Information Sharing Protocol

These members recently signed and endorsed an “Information Sharing Protocol”. The protocol sets out the preconditions, the responsibilities and the rights of the IMWG members for sharing, updating and accessing the data of the information providers.

With this protocol, each member commits to sharing specific datasets, in specific formats and at specific intervals. The data provided is classified as either public access or classified accessed. The latter is further disaggregated into three categories:

  1. UN partners only;
  2. IMWG members only;
  3. [Agency/group] only.

There is also a restricted access category, which is granted on a case-by-case basis only.

UNDP/TRMA’s Role

UNDP’s role (via TRMA) in the IMWG is to technically support the administration of the information-sharing between IMWG members. More specifically, UNDP will provide ongoing technical support for the development and upgrading of the IMWG database tool in accoardance with the needs of the Working Group.

In addition, UNDP’s role is to receive data updates, to update the IMWG tool and to circulate data according to classification of access as determined by individual contributing agencies. Would a more seemless information sharing approach might work; one in which UNDP does not have to be the repository of the data let alone manually update the information?

In any case, the very existence of a UN Information Management Working Group in the Sudan suggests that Data Hugging Disorders (DHDs) can be cured.

Patrick Philippe Meier

GeoSurveillance for Crisis Mapping Analytics

Having blogged at length on the rationale for Crisis Mapping Analytics (CMA), I am now interested in assessing the applicability of existing tools for crisis mapping vis-a-vis complex humanitarian emergencies.

In this blog post, I review an open-source software package called GeoSurveillance that combines spatial statistical techniques and GIS routines to perform tests for the detection and monitoring of spatial clustering.

The post is based on the new peer-reviewed article “GeoSurveillance: a GIS-based system for the detection and monitoring of spatial clusters” published in the Journal of Geographical Systems and authored by Ikuho Yamada, Peter Rogerson and Gyoungju Lee.

Introduction

The detection of spatial clusters—testing the null hypothesis of spatial randomness—is a key focus of spatial analysis. My first research project in this area dates back to 1996, when I wrote a software algorithm in C++ to determine the randomness (or non-randomness) of stellar distributions.

stars

The program would read a graphics file of a high-quality black-and-white image of a stellar distribution (that I had scanned from a rather expensive book) and run a pattern analysis procedure to determine what constituted a star and then detect them. Note that the stars were of various sizes and resolutions, with many overlapping in part.

Once the stars were detected, I manually approximated the number of stars in the stellar distributions to evaluate the reliability of my algorithm. The program would then assign (x, y) coordinates to each star. I compared this series of numbers with a series of pseudo-random numbers that I generated independently.

Using the Kolmogorov-Smirnov test in two-dimensions, I could then test the probability that the series of (x, y) coordinates pseudo-random numbers were samples that came from the same set.

Retrospective vs Prospective Analysis

This type of spatial cluster analysis on stellar distributions is retrospective and the majority of methods developed to date belong to this class of tests.

The other class of spatial cluster detection is called prospective testing. This testing is designed for time-series data that is updated over time and test statistics are computed when new data becomes available. “While retrospective tests focus on a static aspect of spatial patterns, prospective tests take into account their dynamic nature and attempt to find new, emergent clusters as quickly as possible.”

There has been a surge of interest in this prospective approach following the anthrax attacks of 2001 and the perceived threat of bioterrorism since. But as the authors of the GeoSurveillance study note, prospective monitoring approaches have broader application, “including the detection of outbreaks of food poisoning and infectious diseases and the detection of emergent crime hotspots.” And I would add crisis mapping for complex humanitarian emergencies.

Very little work has been done using retrospective analysis for crisis mapping and even less using prospective techniques. Both are equally important. The former is critical if we want to have a basis (and indeed baseline) to know what deviations and patterns to look for. The former is important since as humanitarian practitioners and policy makers, we are interested in operational conflict prevention.

Spatial Analysis Software

While several GIS software packages provide functionalities for retrospective analysis of spatial patterns, “few provide for prospective analysis,” with the notable exception of SaTScan, which enables both applications. SaTScan does has two drawbacks, however.

The first is that “prospective analysis in SaTScan is not adjusted in a statistically rigorous manner for repeated time-periodic tests conducted as new data become available.” Secondly, the platform “does not offer any GIS functionality for quick visual assessment of detected clusters.”

What is needed is a platform that provides a convenient graphical user-interface (GUI) that allows users to identify spatial clusters both statistically and visually. GeoSurveillance seeks to do just this.

Introducing GeoSurveillance

This spatial analysis software consists of three components: a cluster detection and monitoring component, a GIS component and a support tool component as depicted below.

GeoSurveillance

  • “The cluster detection and monitoring component is further divided into retrospective and prospective analysis tools, each of which has a corresponding user-interface where parameters and options for the analysis are to be set. When the analysis is completed, the user-interfaces also provide a textual and/or graphical summary of results.”
  • “The GIS component generates map representation of the results, where basic GIS functionalities such as zoom in/out, pan, and identify are available. For prospective analysis, the resulting map representation is updated every time a statistical computation for a time unit is completed so that spatial patterns changing over time can be visually assessed as animation.”
  • “The support tool component provides various auxiliary tools for user.”

The table below presents a summary (albeit not exhaustive) of statistical tests for cluster detection. The methods labeled in bold are currently available within GeoSurveillance.

GeoSurveillance2

GeoSurveillance uses the local score statistic for retrospective analysis and applies the univariate cumulative sum (cusum) method. Cusum methods are familiar to public health professionals since they are often applied to public health monitoring.

Both methods are somewhat involved mathematically speaking so I won’t elaborate on them here. Suffice it to say that the complexity of spatial analysis techniques needs to be “hidden” from the average user if this kind of platform is to be used by humanitarian practitioners in the field.

Applying GeoSurveillance

The authors Yamada et. al used the platform to carry out a particularly interesting study of low birth weight (LBW) incidence data in Los Angeles, California.

Traditional studies “on LBW have focused on individual-level risk factors such as race/ethnicity, maternal age, maternal education, use of prenatal care, smoking and other substance abuse during pregnancy.” However, such individual factors have had little ability to explain the risk of LBW. To this end, “increasing attention has been directed to neighborhood-level risk factors including […] racial/ethnic composition, economic status, crime rate, and population growth trend.”

The authors of the GeoSurveillance study thus hypothesize that “the risk of LBW incidence and its change over time have non-random spatial patterns reflecting background distributions of neighborhood-level risk factors.” The results of the retrospective and prospective analysis using GeoSurveillance is available both in tabular and map formats. The latter format is displayed and interpreted below.

GeoSurveillance3

Using GeoSurveillance’s retrospective analysis functionality enable the authors to automatically detect high risk areas of LWB (marked in red) as well as the zone with the highest abnormal incidents of LBW (marked in yellow). The maps above indicate that a large concentration of neighborhoods with high risk of LBW are found “near downtown Los Angeles extending toward the northwest, and three smaller ones in the eastern part of the county.”

GeoSurveillance4

Carrying out prospective analysis on the LWB data enabled the authors to conclude that high the risk of LBW “used to be concentrated in particular parts of the county but is now more broadly spread throughout the county.” This result now provides the basis for further investigation to “identify individual- and neighborhood-level factors that relate to this change in the spatial distribution of the LBW risk.”

Conclusion

The developers of GeoSurveillance plan to implement more methods in the next version, especially for prospective analysis given the limited availability of such methods in other GIS software. The GeoSurveillance software as well as associated documentation and sample datasets can be downloaded here.

I have downloaded the software myself and will start experimenting shortly with some Ushahidi and/or PRIO data if possible. Stay tuned for an update.

Patrick Philippe Meier

Ushahidi for Mobile Banking

I just participated in a high-level mobile banking (mBanking) conference in Nairobi, which I co-organized with colleagues from The Fletcher School.

Participants included the Governor of Kenya’s Central Bank, Kenya’s Finance Minister, the directors/CEO’s of Safaricom, Equity Bank, Bankable Frontier Associates, Iris Wireless, etc, and senior representatives from the Central Banks of Tanzania, Rwanda and Burundi as well as CGAP, Google, DAI, etc.

mBanking1

The conference blog is available here and the Twitter feed I set up is here. The extensive work that went into organizing this international conference explains my relative absence from iRevolution; that and my three days off the grid in Lamu with Fletcher colleagues and Erik Hersman.

I have already blogged about mBanking here so thought I’d combine  my interest in the subject with my ongoing work with Ushahidi.

One of the issues that keeps cropping up when discussing mBanking (and branchless banking) is the challenge of agent reliability and customer service. How does one ensure the trustworthiness of a growing network of agents and simultaneously handle customer complaints?

A number of speakers at Fletcher’s recent conference highlighted these challenges and warned they would become more pressing with time. So this got me thinking about an Ushahidi-for-mBanking platform.

Since mBanking customers by definition own a mobile phone, a service like M-Pesa or Zap could provide customers with a dedicated short code which they could use to text in concerns or report complaints along with location information. These messages could then be mapped in quasi real-time on an Ushahidi platform. This would provide companies like Safaricom and Zain with a crowdsourced approach to monitoring their growing agent network.

A basic spatial analysis of these customer reports over time would enable Safaricom and Zain to identify trends in customer complaints. The geo-referenced data could also provide the companies with a way to monitor agent-reliability by location. Safaricom could then offer incentives to M-Pesa agents to improve agent compliance and reward them accordingly.

In other words, the “balance of power” would shift from the agent to the customer since the latter would now be in position to report on quality of service.

But why wait for Safaricom and Zain to kick this off? Why not simply launch two public parallel platforms, one for M-Pesa and the other for Zap to determine which of the two companies receive more complaints and how quickly they respond to them?

To make the sites sustainable, one could easily come up with a number of business plan models. One idea might be to provide advertising space on the Ushahidi-mBanking site. In addition, the platform would provide a way to collect the mobile phone numbers of individual clients; this information could then be used to broadcast ads-by-SMS on a weekly basis, for example.

If successful, this approach could be replicated with Wizzit and MTN in South Africa and gCash in the Philippines. I wish I had several more weeks in Nairobi to spearhead this but I’m heading back to the Sudan to continue my consulting work with the UN’s Threat and Risk Mapping Analysis (TRMA).

Patrick Philippe Meier

Moving Forward with Swift River

This is an update on the latest Swift River open group meeting that took place this morning at the InSTEDD office in Palo Alto. Ushahidi colleague Kaushal Jhalla first proposed the idea behind Swift River after the terrorist attacks on Mumbai last November. Ushahidi has since taken on the initiative as a core project since the goal of Swift River is central to the group’s mission: the crowdsourcing of crisis information.

Kaushal and Chris Blow gave the first formal presentation of Swift River during our first Ushahidi strategy meeting in Orlando last March where we formally established the Swift River group, which includes Andrew Turner, Sean Gourely, Erik Hersman and myself in addition to Kaushal and Chris. Andrew has played a pivotal role in getting Swift River and Vote Report India off the ground and I highly recommend reading his blog post on the initiative.

The group now includes several new friends of Ushahidi, a number of whom kindly shared their time and insights this morning after Chris kicked off the meeting to bring everyone up to speed.  The purpose of this blog post is to outline how I hope Swift River moves forward based on this morning’s fruitful session. Please see my previous blog post for an overview of the basic methodology.

The purpose of the Swift River platform, as I proposed this morning, is to provide two core services. The first, to borrow Guarva Mishra‘s description, is to crowdsource the tagging of crisis information. The second is to triangulate the tagged information to assign reality scores to individual events. Confused? Not to worry, it’s actually really straightforward.

Crowdsourcing Tagging

Information on a developing crisis can be captured from several text-based sources such articles from online news media, Tweets and SMS, for example. Of course, video footage, pictures and satellite imagery can also provide important information, but we’re more interested in text-based data for now.

The first point to note is that information can range from being very structured to highly unstructured. The word structure is simply another way of describing how organized information is. A few examples are in order vis-a-vis text-based information.

A book is generally highly structured information. Why? Well, because the author hopefully used page numbers, chapter headings, paragraphs, punctuation, an index and table of contents. The fact that the book is structured makes it easier for the reader to find the information she is looking for. The other end of the “structure spectrum” would be a run-on sentence with nospacesandpunctuation. Not terribly helpful.

Below is a slide from a seminar I taught on disaster and conflict early warning back in 2006; ignore the (c).

ewstructure

The slide above depicts the tradeoff between control and structure. We can impose structure on data collected if we control the data entry process. Surveys are an example of a high-control process that yields high-structure. We want high structure because this allows us to find and analyze the data more easily (c.f. entropy). This has generally been the preferred approach, particularly amongst academics.

If we give up control, as one does when crowdsourcing crisis information, we open ourselves up to the possibility of having to deal with a range of structured and unstructured information. To make sense of this information typically requires data mining and natural language processing (NLP) techniques that can identify structure in said information. For example, we would want to identify nouns, verbs, places and dates in order to extra event-data.

One way to do this would be to automatically tag an article with the parameters “who, what, where and when.” A number of platforms such as Open Calais and Virtual Research Associate’s FORECITE already do this. However, these platforms are not customized for crowdsourcing of crisis information and most are entirely closed. (Note: I did consulting work for VRA many years ago).

So we need to draw (and modify) relevant algorithms that are publically available and provide and a user-friendly interface for human oversight of the automated tagging (what we also referred to as crowdsourcing the filter). Here’s a proposed interface that Chris recently designed for Swift River.

swiftriver

The idea would be to develop an algorithm that parses the text (on the left) and auto-suggests answers for the tags (on the right). The user would then confirm or correct the suggested tags and the algorithm would learn from it’s mistakes. In other words, the algorithm would become more accurate over time and the need for human oversight would decrease. In short, we’d be developing a data-driven ontology backed up by Freebase to provide semantic linkages.

VRA already does this but, (1) the data validation is carried out by one (poor) individual, (2) the articles were restricted to the headlines from Reuters and Agence France Press (AFP) newswires, and (3) the project did not draw on semantic analysis. The validation component entailed making sure that events described in the headlines were correctly coded by the parser and ensuring there were no duplicates. See VRA’s patent for the full methodology (PDF).

Triangulation and Scoring

The above tagging process would yield a highly structured event dataset like the example depicted below.

dataset

We could then use simple machine analysis to cluster the same events together and thereby do away with any duplicate event-data. The four records above would then be collapsed into one record:

datafilter2

But that’s not all. We would use a simple weighting or scoring schema to assign a reality score to determine the probability that the event reported really happened. I already described this schema in my previous post so will just give one example: An event that is reported by more than one source is more likely to have happened. This increases the reality score of the event above and pushes it higher up the list. One could also score an event by the geographical proximity of the source to the reported event, and so on. These scores could be combined to give an overall score.

Compelling Visualization

The database output above is not exactly compelling to most people. This is where we need some creative visualization techniques to render the information more intuitive and interesting. Here are a few thoughts. We could draw on Gapminder to visualize the triangulated event-data over time. We could also use the idea of a volume equalizer display.

equalize

This is not the best equalizer interface around for sure, but hopefully gets the point across. Instead of decibels on the Y-axis, we’d have probability scores that an event really happened. Instead of frequencies on the X-axis, we’d have the individual events. Since the data coming in is not static, the bars would bounce up and down as more articles/tweets get tagged and dumped into the event database.

I think this would be an elegant way to visualize the data, not least because the animation would resemble the flow or waves of a swift river but the idea of using a volume equalizer could be used as analogy to quiet the unwanted noise. For the actual Swift River interface, I’d prefer using more colors to denote different characteristics about the event and would provide the user with the option of double-clicking on a bar to drill down to the event sources and underlying text.

Patrick Philippe Meier