Monthly Archives: November 2012

Why USAID’s Crisis Map of Syria is so Unique

While static, this crisis map includes a truly unique detail. Click on the map below to see a larger version as this may help you spot what is so striking.

For a hint, click this link. Still stumped? Look at the sources listed in the Key.

 

Using E-Mail Data to Estimate International Migration Rates

As is well known, “estimates of demographic flows are inexistent, outdated, or largely inconsistent, for most countries.” I would add costly to that list as well. So my QCRI colleague Ingmar Weber co-authored a very interesting study on the use of e-mail data to estimate international migration rates.

The study analyzes a large sample of Yahoo! emails sent by 43 million users between September 2009 and June 2011. “For each message, we know the date when it was sent and the geographic location from where it was sent. In addition, we could link the message with the person who sent it, and with the user’s demographic information (date of birth and gender), that was self reported when he or she signed up for a Yahoo! account. We estimated the geographic location from where each email message was sent using the IP address of the user.”

The authors used data on existing migration rates for a dozen countries and international statistics on Internet diffusion rates by age and gender in order to correct for selection bias. For example, “estimated number of migrants, by age group and gender, is multiplied by a correction factor to adjust for over-representation of more educated and mobile people in groups for which the Internet penetration is low.” The graphs below are estimates of age and gender-specific immigration rates for the Philippines. “The gray area represents the size of the bias correction.” This means that “without any correction for bias, the point estimates would be at the upper end of the gray area.” These methods “correct for the fact that the group of users in the sample, although very large, is not representative of the entire population.”

The results? Ingmar and his co-author Emilio Zagheni were able to “estimate migration rates that are consistent with the ones published by those few countries that compile migration statistics. By using the same method for all geographic regions, we obtained country statistics in a consistent way, and we generated new information for those countries that do not have registration systems in place (e.g., developing countries), or that do not collect data on out-migration (e.g., the United States).” Overall, the study documented a “global trend of increasing mobility,” which is “growing at a faster pace for females than males. The rate of increase for different age groups varies across countries.”

The authors argue that this approach could also be used in the context of “natural” disasters and man-made disasters. In terms of future research, they are interested in evaluating “whether sending a high proportion of e-mail messages to a particular country (which is a proxy for having a strong social network in the country) is related to the decision of actually moving to the country.” Naturally, they are also interested in analyzing Twitter data. “In addition to mobility or migration rates, we could evaluate sentiments pro or against migration for different geographic areas. This would help us understand how sentiments change near an international border or in regions with different migration rates and economic conditions.”

I’m very excited to have Ingmar at QCRI so we can explore these ideas further and in the context of humanitarian and development challenges. I’ve been dis-cussing similar research ideas with my colleagues at UN Global Pulse and there may be a real sweet spot for collaboration here, particularly with the recently launched Pulse Lab in Jakarta.” The possibility of collaborating with my collea-gues at Flowminder could also be really interesting given their important study of population movement following the Haiti Earthquake. In conclusion, I fully share the authors’ sentiment when they highlight the fact that it is “more and more important to develop models for data sharing between private com-panies and the academic world, that allow for both protection of users’ privacy & private companies’ interests, as well as reproducibility in scientific publishing.”

Rapidly Verifying the Credibility of Information Sources on Twitter

One of the advantages of working at QCRI is that I’m regularly exposed to peer-reviewed papers presented at top computing conferences. This is how I came across an initiative called “Seriously Rapid Source Review” or SRSR. As many iRevolution readers know, I’m very interested in information forensics as applied to crisis situations. So SRSR certainly caught my attention.

The team behind SRSR took a human centered design approach in order to integrate journalistic practices within the platform. There are four features worth noting in this respect. The first feature to note in the figure below is the automated filter function, which allows one to view tweets generated by “Ordinary People,” “Journalists/Bloggers,” “Organizations,” “Eyewitnesses” and “Uncategorized.” The second feature, Location, “shows a set of pie charts indica-ting the top three locations where the user’s Twitter contacts are located. This cue provides more location information and indicates whether the source has a ‘tie’ or other personal interest in the location of the event, an aspect of sourcing exposed through our preliminary interviews and suggested by related work.”


The third feature worth noting is the “Eyewitness” icon. The SRSR team developed the first ever automatic classifier to identify eyewitness reports shared on Twitter. My team and I at QCRI are developing a second one that focuses specifically on automatically classifying eyewitness reports during sudden-onset natural disasters. The fourth feature is “Entities,” which displays the top five entities that the user has mentioned in their tweet history. These include references to organizations, people and places, which can reveal important patterns about the twitter user in question.

Journalists participating in this applied research found the “Location” feature particularly important when assess the credibility of users on Twitter. They noted that “sources that had friends in the location of the event were more believable, indicating that showing friends’ locations can be an indicator of credibility.” One journalist shared the following: “I think if it’s someone without any friends in the region that they’re tweeting about then that’s not nearly as authoritative, whereas if I find somebody who has 50% of friends are in [the disaster area], I would immediately look at that.”

In addition, the automatic identification of “eyewitnesses” was deemed essential by journalists who participated in the SRSR study. This should not be surprising since “news organizations often use eyewitnesses to add credibility to reports by virtue of the correspondent’s on-site proximity to the event.” Indeed, “Witness-ing and reporting on what the journalist had witnessed have long been seen as quintessential acts of journalism.” To this end, “social media provides a platform where once passive witnesses can become active and share their eyewitness testimony with the world, including with journalists who may choose to amplify their report.”

In sum, SRSR could be used to accelerate the verification of social media con-tent, i.e., go beyond source verification alone. For more on SRSR, please see this computing paper (PDF), which was authored by Nicholas Diakopoulos, Munmun De Choudhury and Mor Naaman.

The Most Impressive Live Global Twitter Map, Ever?

My colleague Kalev Leetaru has just launched The Global Twitter Heartbeat Project in partnership with the Cyber Infrastructure and Geospatial Information Laboratory (CIGI) and GNIP. He shared more information on this impressive initiative with the CrisisMappers Network this morning.

According to Kalev, the project “uses an SGI super-computer to visualize the Twitter Decahose live, applying fulltext geocoding to bring the number of geo-located tweets from 1% to 25% (using a full disambiguating geocoder that uses all of the user’s available information in the Twitter stream, not just looking for mentions of major cities), tone-coding each tweet using a twitter-customized dictionary of 30,000 terms, and applying a brand-new four-stage heatmap engine (this is where the supercomputer comes in) that makes a map of the number of tweets from or about each location on earth, a second map of the average tone of all tweets for each location, a third analysis of spatial proximity (how close tweets are in an area), and a fourth map as needed for the percent of all of those tweets about a particular topic, which are then all brought together into a single heatmap that takes all of these factors into account, rather than a sequence of multiple maps.”

Kalev added that, “For the purposes of this demonstration we are processing English only, but are seeing a nearly identical spatial profile to geotagged all-languages tweets (though this will affect the tonal results).” The Twitterbeat team is running a live demo showing both a US and world map updated in realtime at Supercomputing on a PufferSphere and every few seconds on the SGI website here.”


So why did Kalev share all this with the CrisisMappers Network? Because he and his team created a rather unique crisis map composed of all tweets about Hurricane Sandy, see the YouTube video above. “[Y]ou  can see how the whole country lights up and how tweets don’t just move linearly up the coast as the storm progresses, capturing the advance impact of such a large storm and its peripheral effects across the country.” The team also did a “similar visualization of the recent US Presidential election showing the chaotic nature of political communication in the Twittersphere.”


To learn more about the project, I recommend watching Kalev’s 2-minute introductory video above.

What Percentage of Tweets Generated During a Crisis Are Relevant for Humanitarian Response?

More than half-a-million tweets were generated during the first three days of Hurricane Sandy and well over 400,000 pictures were shared via Instagram. Last year, over one million tweets were generated every five minutes on the day that Japan was struck by a devastating earthquake and tsunami. Humanitarian organi-zations are ill-equipped to manage this volume and velocity of information. In fact, the lack of analysis of this “Big Data” has spawned all kinds of suppositions about the perceived value—or lack thereof—that social media holds for emer-gency response operations. So just what percentage of tweets are relevant for humanitarian response?

One of the very few rigorous and data-driven studies that addresses this question is Dr. Sarah Vieweg‘s 2012 doctoral dissertation on “Situational Awareness in Mass Emergency: Behavioral and Linguistic Analysis of Disaster Tweets.” After manually analyzing four distinct disaster datasets, Vieweg finds that only 8% to 20% of tweets generated during a crisis provide situational awareness. This implies that the vast majority of tweets generated during a crisis have zero added value vis-à-vis humanitarian response. So critics have good reason to be skeptical about the value of social media for disaster response.

At the same time, however, even if we take Vieweg’s lower bound estimate, 8%, this means that over 40,000 tweets generated during the first 72 hours of Hurricane Sandy may very well have provided increased situational awareness. In the case of Japan, more than 100,000 tweets generated every 5 minutes may have provided additional situational awareness. This volume of relevant infor-mation is much higher and more real-time than the information available to humanitarian responders via traditional channels.

Furthermore, preliminary research by QCRI’s Crisis Computing Team show that 55.8% of 206,764 tweets generated during a major disaster last year were “Informative,” versus 22% that were “Personal” in nature. In addition, 19% of all tweets represented “Eye-Witness” accounts, 17.4% related to information about “Casualty/Damage,” 37.3% related to “Caution/Advice,” while 16.6% related to “Donations/Other Offers.” Incidentally, the tweets were automatically classified using algorithms developed by QCRI. The accuracy rate of these ranged from 75%-81% for the “Informative Classifier,” for example. A hybrid platform could then push those tweets that are inaccurately classified to a micro-tasking platform for manual classification, if need be.

This research at QCRI constitutes the first phase of our work to develop a Twitter Dashboard for the Humanitarian Cluster System, which you can read more about in this blog post. We are in the process of analyzing several other twitter datasets in order to refine our automatic classifiers. I’ll be sure to share our preliminary observations and final analysis via this blog.

Crowdsourcing the Evaluation of Post-Sandy Building Damage Using Aerial Imagery

Update (Nov 2): 5,739 aerial images tagged by over 3,000 volunteers. Please keep up the outstanding work!

My colleague Schuyler Erle from Humanitarian OpenStreetMap  just launched a very interesting effort in response to Hurricane Sandy. He shared the info below via CrisisMappers earlier this morning, which I’m turning into this blog post to help him recruit more volunteers.

Schuyler and team just got their hands on the Civil Air Patrol’s (CAP) super high resolution aerial imagery of the disaster affected areas. They’ve imported this imagery into their Micro-Tasking Server MapMill created by Jeff Warren and are now asking volunteers to help tag the images in terms of the damage depicted in each photo. “The 531 images on the site were taken from the air by CAP over New York, New Jersey, Rhode Island, and Massachusetts on 31 Oct 2012.”

To access this platform, simply click here: http://sandy.hotosm.org. If that link doesn’t work,  please try sandy.locative.us.

“For each photo shown, please select ‘ok’ if no building or infrastructure damage is evident; please select ‘not ok’ if some damage or flooding is evident; and please select ‘bad’ if buildings etc. seem to be significantly damaged or underwater. Our *hope* is that the aggregation of the ok/not ok/bad ratings can be used to help guide FEMA resource deployment, or so was indicated might be the case during RELIEF at Camp Roberts this summer.”

A disaster response professional working in the affected areas for FEMA replied (via CrisisMappers) to Schuyler’s efforts to confirm that:

“[G]overnment agencies are working on exploiting satellite imagery for damage assessments and flood extents. The best way that you can help is to help categorize photos using the tool Schuyler provides [...].  CAP imagery is critical to our decision making as they are able to work around some of the limitations with satellite imagery so that we can get an area of where the worst damage is. Due to the size of this event there is an overwhelming amount of imagery coming in, your assistance will be greatly appreciated and truly aid in response efforts.  Thank you all for your willingness to help.”

Schuyler notes that volunteers can click on the Grid link from the home page of the Micro-Tasking platform to “zoom in to the coastlines of Massachusetts or New Jersey” and see “judgements about building damages beginning to aggregate in US National Grid cells, which is what FEMA use operationally. Again, the idea and intention is that, as volunteers judge the level of damage evident in each photo, the heat map will change color and indicate at a glance where the worst damage has occurred.” See above screenshot.

Even if you just spend 5 or 10 minutes tagging the imagery, this will still go a long way to supporting FEMA’s response efforts. You can also help by spreading the word and recruiting others to your cause. Thank you!