This is an update on the latest Swift River open group meeting that took place this morning at the InSTEDD office in Palo Alto. Ushahidi colleague Kaushal Jhalla first proposed the idea behind Swift River after the terrorist attacks on Mumbai last November. Ushahidi has since taken on the initiative as a core project since the goal of Swift River is central to the group’s mission: the crowdsourcing of crisis information.
Kaushal and Chris Blow gave the first formal presentation of Swift River during our first Ushahidi strategy meeting in Orlando last March where we formally established the Swift River group, which includes Andrew Turner, Sean Gourely, Erik Hersman and myself in addition to Kaushal and Chris. Andrew has played a pivotal role in getting Swift River and Vote Report India off the ground and I highly recommend reading his blog post on the initiative.
The group now includes several new friends of Ushahidi, a number of whom kindly shared their time and insights this morning after Chris kicked off the meeting to bring everyone up to speed. The purpose of this blog post is to outline how I hope Swift River moves forward based on this morning’s fruitful session. Please see my previous blog post for an overview of the basic methodology.
The purpose of the Swift River platform, as I proposed this morning, is to provide two core services. The first, to borrow Guarva Mishra‘s description, is to crowdsource the tagging of crisis information. The second is to triangulate the tagged information to assign reality scores to individual events. Confused? Not to worry, it’s actually really straightforward.
Information on a developing crisis can be captured from several text-based sources such articles from online news media, Tweets and SMS, for example. Of course, video footage, pictures and satellite imagery can also provide important information, but we’re more interested in text-based data for now.
The first point to note is that information can range from being very structured to highly unstructured. The word structure is simply another way of describing how organized information is. A few examples are in order vis-a-vis text-based information.
A book is generally highly structured information. Why? Well, because the author hopefully used page numbers, chapter headings, paragraphs, punctuation, an index and table of contents. The fact that the book is structured makes it easier for the reader to find the information she is looking for. The other end of the “structure spectrum” would be a run-on sentence with nospacesandpunctuation. Not terribly helpful.
Below is a slide from a seminar I taught on disaster and conflict early warning back in 2006; ignore the (c).
The slide above depicts the tradeoff between control and structure. We can impose structure on data collected if we control the data entry process. Surveys are an example of a high-control process that yields high-structure. We want high structure because this allows us to find and analyze the data more easily (c.f. entropy). This has generally been the preferred approach, particularly amongst academics.
If we give up control, as one does when crowdsourcing crisis information, we open ourselves up to the possibility of having to deal with a range of structured and unstructured information. To make sense of this information typically requires data mining and natural language processing (NLP) techniques that can identify structure in said information. For example, we would want to identify nouns, verbs, places and dates in order to extra event-data.
One way to do this would be to automatically tag an article with the parameters “who, what, where and when.” A number of platforms such as Open Calais and Virtual Research Associate’s FORECITE already do this. However, these platforms are not customized for crowdsourcing of crisis information and most are entirely closed. (Note: I did consulting work for VRA many years ago).
So we need to draw (and modify) relevant algorithms that are publically available and provide and a user-friendly interface for human oversight of the automated tagging (what we also referred to as crowdsourcing the filter). Here’s a proposed interface that Chris recently designed for Swift River.
The idea would be to develop an algorithm that parses the text (on the left) and auto-suggests answers for the tags (on the right). The user would then confirm or correct the suggested tags and the algorithm would learn from it’s mistakes. In other words, the algorithm would become more accurate over time and the need for human oversight would decrease. In short, we’d be developing a data-driven ontology backed up by Freebase to provide semantic linkages.
VRA already does this but, (1) the data validation is carried out by one (poor) individual, (2) the articles were restricted to the headlines from Reuters and Agence France Press (AFP) newswires, and (3) the project did not draw on semantic analysis. The validation component entailed making sure that events described in the headlines were correctly coded by the parser and ensuring there were no duplicates. See VRA’s patent for the full methodology (PDF).
Triangulation and Scoring
The above tagging process would yield a highly structured event dataset like the example depicted below.
We could then use simple machine analysis to cluster the same events together and thereby do away with any duplicate event-data. The four records above would then be collapsed into one record:
But that’s not all. We would use a simple weighting or scoring schema to assign a reality score to determine the probability that the event reported really happened. I already described this schema in my previous post so will just give one example: An event that is reported by more than one source is more likely to have happened. This increases the reality score of the event above and pushes it higher up the list. One could also score an event by the geographical proximity of the source to the reported event, and so on. These scores could be combined to give an overall score.
The database output above is not exactly compelling to most people. This is where we need some creative visualization techniques to render the information more intuitive and interesting. Here are a few thoughts. We could draw on Gapminder to visualize the triangulated event-data over time. We could also use the idea of a volume equalizer display.
This is not the best equalizer interface around for sure, but hopefully gets the point across. Instead of decibels on the Y-axis, we’d have probability scores that an event really happened. Instead of frequencies on the X-axis, we’d have the individual events. Since the data coming in is not static, the bars would bounce up and down as more articles/tweets get tagged and dumped into the event database.
I think this would be an elegant way to visualize the data, not least because the animation would resemble the flow or waves of a swift river but the idea of using a volume equalizer could be used as analogy to quiet the unwanted noise. For the actual Swift River interface, I’d prefer using more colors to denote different characteristics about the event and would provide the user with the option of double-clicking on a bar to drill down to the event sources and underlying text.