Automatically Extracting Disaster-Relevant Information from Social Media

Latest update on AIDR available here

My team and I at QCRI have just had this paper (PDF) accepted at the Social Web for Disaster Management workshop at the World Wide Web (WWW 2013) conference in Rio next month. The paper relates directly to our Artificial Intelligence for Disaster Response (AIDR) project. One of our main missions at QCRI is to develop open source and freely available next generation humanitarian technologies to better manage Big (Crisis) Data. Over 20 million tweets and half-a-million Instagram pictures were posted during Hurricane Sandy, for example. In Japan, more 2,000 tweets were posted every second the day after the devastating earthquake and Tsunami struck the Eastern Coast. Recent empirical studies have shown that an important percentage of tweets posted during disaster are informative and even actionable. The challenge before us is how to find those proverbial needles in the haystack and how to do so in as close to real-time as possible.

So we analyzed disaster tweets posted during Hurricane Sandy (2012) and the Joplin Tornado (2011). We demonstrate that disaster-relevant information can be automatically extracted from these datasets. The results indicate that 40% to 80% of tweets that contain disaster-related information can be automatically detected. We also demonstrate that we can correctly identify the type of disaster information 80% to 90% of the time. This means, for example, that once we identify a disaster tweet, we can automatically correctly determine whether that tweet was written by an eyewitness 80%-90% of the time. Because these classifiers are developed using machine learning, they get more accurate with more data. This explains why we are building AIDR. Our aim is not to replace human involvement and oversight but to take much of the weight off the shoulders of humans.

The classifiers we’ve developed automatically identify tweets that are personal in nature and those that are informative—that is, tweets that are of interest to others beyond the author’s immediate circle. We also created classifiers to differentiate between informative content shared by eye-witnesses versus content that is simply recycled by other sources such as the media. What’s more, we also created classifiers to distinguish between various types of informative content. Additionally to classifying, we extract key phrases from each tweet. A key phrase summarizes the essential message of a tweet on a few words, allowing for better visualization/aggregation of content. Below, we list real-world examples of tweets on each class. The underlined text is what the extraction system finds to be the key phrase of each tweet:

Caution and Advice: message conveys/reports information about some warning or a piece of advice about a possible hazard.

.@NYGovCuomo orders closing of NYC bridges. Only Staten Island bridges unaffected at this time. Bridges must close by 7pm. #Sandy

Casualties and Damage: message mentions casualties or infrastructure damage related to the disaster.

At least 39 dead; millions without power in Sandy’s aftermath. http//[Link].

Donations and Offers: message speaks about goods or services offered or needed by the victims of an incident.

400 Volunteers are needed for areas that #Sandy destroyed.
I want to volunteer to help the hurricane Sandy victims. If anyone knows how I can get involved please let me know!

People Missing, Found, or Seen: message reports about a missing or found person affected by an incident, or reports reaction or visit of a celebrity.

rt @911buff: public help needed: 2 boys 2 & 4 missing nearly 24 hours after they got separated from their mom when car submerged in si. #sandy #911buff

Information Sources: message points to information sources, photos, videos; or mentions a website, TV or radio station providing extensive coverage.

RT @NBCNewsPictures: Photos of the unbelievable scenes left in #Hurricane #Sandy’s wake http//[Link] #NYC #NJ

The two metrics used to assess the results of our analysis are: “Detection Rate” and “Hit Ratio”. The best way explain these metrics is by way of analogy. The Detection Rate measures how good your fishing net is. If you know (thanks to sonar) that there are 10 fish in the pond and your net is good enough to catch all 10, then your Detection Rate is 100%. If you catch 8 out of 10, you rate is 80%. In other words, the Detection Rate is a measure of sensitivity. Now say you’ve designed the world’s first ever “Smart Net” which only catches salmon and thus leaves all other fish in the same pond alone. Now say you caught 5 fish and that you wanted salmon. If all 5 are salmon, your Hit Ratio is 100%. If only 2 of them are salmon, then your Hit Ratio is 40%. In other words, Hit Ratio is a measure of accuracy.

Turning to our results, the Detection Rate was higher for Joplin (78%) than for Sandy (41%). The Hit Ratio is also higher for Joplin (90%) than for Sandy (78%). In other words, our classifiers find the Sandy dataset more challenging to decode. That that said, the Hit Ratio is rather high in both cases, indicating that when our system extracts some part of the tweet, it is often the correct part. In sum, our approach can detect from 40% to 80% of the tweets containing disaster-related information and can correctly identify the specific type of disaster information 80% to 90% of the time. This means, for example, that once we identify a disaster tweet, we can automatically correctly determine whether that tweet was written by an eyewitness between 80% to 90% of the time. Because these classifiers are developed using machine learning, they get more accurate with more data. This explains why we are building AIDR. Our aim is not to replace human involvement and oversight but to significantly lessen the load on humans.

This tweet-level extraction is key to extracting more reliable high-level information. Observing, for instance, that a large number of tweets in similar locations report the same infrastructure as being damaged, may be a strong indicator that this is indeed the case. So we are very much continuing our research and working hard to increase both Detection Rates and Hit Ratios.

18 responses to “Automatically Extracting Disaster-Relevant Information from Social Media”

Leave a reply to Patrick Meier Cancel reply

Patrick Meier, PhD

Table of Contents