Automatically Extracting Disaster-Relevant Information from Social Media

Latest update on AIDR available here

My team and I at QCRI have just had this paper (PDF) accepted at the Social Web for Disaster Management workshop at the World Wide Web (WWW 2013) conference in Rio next month. The paper relates directly to our Artificial Intelligence for Disaster Response (AIDR) project. One of our main missions at QCRI is to develop open source and freely available next generation humanitarian technologies to better manage Big (Crisis) Data. Over 20 million tweets and half-a-million Instagram pictures were posted during Hurricane Sandy, for example. In Japan, more 2,000 tweets were posted every second the day after the devastating earthquake and Tsunami struck the Eastern Coast. Recent empirical studies have shown that an important percentage of tweets posted during disaster are informative and even actionable. The challenge before  us is how to find those proverbial needles in the haystack and how to do so in as close to real-time as possible.

Screen Shot 2013-04-01 at 11.22.09 AM

So we analyzed disaster tweets posted during Hurricane Sandy (2012) and the Joplin Tornado (2011). We demonstrate that disaster-relevant information can be automatically extracted from these datasets. The results indicate that 40% to 80% of tweets that contain disaster-related information can be automatically detected. We also demonstrate that we can correctly identify the type of disaster information 80% to 90% of the time. This means, for example, that once we identify a disaster tweet, we can automatically correctly determine whether that tweet was written by an eyewitness 80%-90% of the time. Because these classifiers are developed using machine learning, they get more accurate with more data. This explains why we are building AIDR. Our aim is not to replace human involvement and oversight but to take much of the weight off the shoulders of humans.

The classifiers we’ve developed automatically identify tweets that are personal in nature and those that are informative—that is, tweets that are of interest to others beyond the author’s immediate circle. We also created classifiers to differentiate between informative content shared by eye-witnesses versus content that is simply recycled by other sources such as the media. What’s more, we also created classifiers to distinguish between various types of informative content. Additionally to classifying, we extract key phrases from each tweet. A key phrase summarizes the essential message of a tweet on a few words, allowing for better visualization/aggregation of content. Below, we list real-world examples of tweets on each class. The underlined text is what the extraction system finds to be the key phrase of each tweet:

Caution and Advice: message conveys/reports information about some warning or a piece of advice about a possible hazard.

  • .@NYGovCuomo orders closing of NYC bridges. Only Staten Island bridges unaffected at this time. Bridges must close by 7pm. #Sandy

Casualties and Damage: message mentions casualties or infrastructure damage related to the disaster.

  • At least 39 dead; millions without power in Sandy’s aftermath. http//[Link].

Donations and Offers:  message speaks about goods or services offered or needed by the victims of an incident.

  • 400 Volunteers are needed for areas that #Sandy destroyed.
  • I want to volunteer to help the hurricane Sandy victims. If anyone knows how I can get involved please let me know!

People Missing, Found, or Seen: message reports about a missing or found person affected by an incident, or reports reaction or visit of a celebrity.

  • rt @911buff: public help needed: 2 boys 2 & 4 missing nearly 24 hours after they got separated from their mom when car submerged in si. #sandy #911buff
Information Sources: message points to information sources, photos, videos; or mentions a website, TV or radio station providing extensive coverage.
  • RT @NBCNewsPictures: Photos of the unbelievable scenes left in #Hurricane #Sandy’s wake http//[Link] #NYC #NJ 

National Geographic

The two metrics used to assess the results of our analysis are: “Detection Rate” and “Hit Ratio”. The best way explain these metrics is by way of analogy. The Detection Rate measures how good your fishing net is. If you know (thanks to sonar) that there are 10 fish in the pond and your net is good enough to catch all 10, then your Detection Rate is 100%. If you catch 8 out of 10, you rate is 80%. In other words, the Detection Rate is a measure of sensitivity. Now say you’ve designed the world’s first ever “Smart Net” which only catches salmon and thus leaves all other fish in the same pond alone. Now say you caught 5 fish and that you wanted salmon. If all 5 are salmon, your Hit Ratio is 100%. If only 2 of them are salmon, then your Hit Ratio is 40%. In other words, Hit Ratio is a measure of accuracy.

Turning to our results, the Detection Rate was higher for Joplin (78%) than for Sandy (41%). The Hit Ratio is also higher for Joplin (90%) than for Sandy (78%). In other words, our classifiers find the Sandy dataset more challenging to decode. That that said, the Hit Ratio is rather high in both cases, indicating that when our system extracts some part of the tweet, it is often the correct part. In sum, our approach can detect from 40% to 80% of the tweets containing disaster-related information and can correctly identify the specific type of disaster information 80% to 90% of the time. This means, for example, that once we identify a disaster tweet, we can automatically correctly determine whether that tweet was written by an eyewitness between 80% to 90% of the time. Because these classifiers are developed using machine learning, they get more accurate with more data. This explains why we are building AIDR. Our aim is not to replace human involvement and oversight but to significantly lessen the load on humans.

This tweet-level extraction is key to extracting more reliable high-level information. Observing, for instance, that a large number of tweets in similar locations report the same infrastructure as being damaged, may be a strong indicator that this is indeed the case. So we are very much continuing our research and working hard to increase both Detection Rates and Hit Ratios.

bio

See also:

18 responses to “Automatically Extracting Disaster-Relevant Information from Social Media

  1. Hi Patrick. Today, I came across this topical article in the Harvard Business Review:
    http://blogs.hbr.org/cs/2013/04/the_hidden_biases_in_big_data.html.
    The author point to two questions that need to be asked when considering the veracity of ‘Big Data’: 1. Who was excluded? 2. Which places are less visible? The author also notes that biases in the data are moving targets.
    What can we do at the operational level to correct for these biases?

  2. Pingback: Artificial Intelligence for Monitoring Elections (AIME) | iRevolution

  3. Pingback: Web App Tracks Breaking News Using Wikipedia Edits | iRevolution

  4. Pingback: Wikipedia Live Monitor , app rastrea últimas noticias en tiempo real | iRescate

  5. Pingback: Are those Photos for Real? | idisaster 2.0

  6. Pingback: Over 1 Million Tweets from Oklahoma Tornado Automatically Processed | iRevolution

  7. Pingback: Over 2 Million Tweets from Oklahoma Tornado Automatically Processed (Updated) | iRevolution

  8. Hi Patrick, in the light of the American Red Cross report on social media usage and expectations in emergencies, how do you see that an always imperfect artificial algorithm can be used to identify specific cries for help? Or does an emergency service need to monitor the 20M tweets manually anyway to detect these alerts?

  9. Pingback: Analyzing 2 Million Disaster Tweets from Oklahoma Tornado | iRevolution

  10. Pingback: Oklahoma Help With Tweets | tweet4success.com

  11. Pingback: How ReCAPTCHA Can Be Used for Disaster Response | iRevolution

  12. Pingback: Automatically Identifying Fake Images Shared on Twitter During Disasters | iRevolution

  13. Pingback: Boston Marathon Explosions: Analyzing First 1,000 Seconds on Twitter | iRevolution

  14. Pingback: Making All Voices Count Using SMS and Advanced Computing | iRevolution

  15. Pingback: Enabling Crowdfunding on Twitter for Disaster Response | iRevolution

  16. Pingback: MicroMappers Launched for Pakistan Earthquake Response | iRevolution

  17. Pingback: Results of MicroMappers Response to Pakistan Earthquake | iRevolution

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s