Tag Archives: analysis

Big Data for Development: From Information to Knowledge Societies?

Unlike analog information, “digital information inherently leaves a trace that can be analyzed (in real-time or later on).” But the “crux of the ‘Big Data’ paradigm is actually not the increasingly large amount of data itself, but its analysis for intelligent decision-making (in this sense, the term ‘Big Data Analysis’ would actually be more fitting than the term ‘Big Data’ by itself).” Martin Hilbert describes this as the “natural next step in the evolution from the ‘Information Age’ & ‘Information Societies’ to ‘Knowledge Societies’ [...].”

Hilbert has just published this study on the prospects of Big Data for inter-national development. “From a macro-perspective, it is expected that Big Data informed decision-making will have a similar positive effect on efficiency and productivity as ICT have had during the recent decade.” Hilbert references a 2011 study that concluded the following: “firms that adopted Big Data Analysis have output and productivity that is 5–6 % higher than what would be expected given their other investments and information technology usage.” Can these efficiency gains be brought to the unruly world of international development?

To answer this question, Hilbert introduces the above conceptual framework to “systematically review literature and empirical evidence related to the pre-requisites, opportunities and threats of Big Data Analysis for international development.” Words, Locations, Nature and Behavior are types of data that are becoming increasingly available in large volumes.

“Analyzing comments, searches or online posts [i.e., Words] can produce nearly the same results for statistical inference as household surveys and polls.” For example, “the simple number of Google searches for the word ‘unemployment’ in the U.S. correlates very closely with actual unemployment data from the Bureau of Labor Statistics.” Hilbert argues that the tremendous volume of free textual data makes “the work and time-intensive need for statistical sampling seem almost obsolete.” But while the “large amount of data makes the sampling error irrelevant, this does not automatically make the sample representative.” 

The increasing availability of Location data (via GPS-enabled mobile phones or RFIDs) needs no further explanation. Nature refers to data on natural processes such as temperature and rainfall. Behavior denotes activities that can be captured through digital means, such as user-behavior in multiplayer online games or economic affairs, for example. But “studying digital traces might not automatically give us insights into offline dynamics. Besides these biases in the source, the data-cleaning process of unstructured Big Data frequently introduces additional subjectivity.”

The availability and analysis of Big Data is obviously limited in areas with scant access to tangible hardware infrastructure. This corresponds to the “Infra-structure” variable in Hilbert’s framework. “Generic Services” refers to the production, adoption and adaptation of software products, since these are a “key ingredient for a thriving Big Data environment.” In addition, the exploitation of Big Data also requires “data-savvy managers and analysts and deep analytical talent, as well as capabilities in machine learning and computer science.” This corresponds to “Capacities and Knowledge Skills” in the framework.

The third and final side of the framework represents the types of policies that are necessary to actualize the potential of Big Data for international develop-ment. These policies are divided into those that elicit a Positive Feedback Loops such as financial incentives and those that create regulations such as interoperability, that is, Negative Feedback Loops.

The added value of Big Data Analytics is also dependent on the availability of publicly accessible data, i.e., Open Data. Hilbert estimates that a quarter of US government data could be used for Big Data Analysis if it were made available to the public. There is a clear return on investment in opening up this data. On average, governments with “more than 500 publicly available databases on their open data online portals have 2.5 times the per capita income, and 1.5 times more perceived transparency than their counterparts with less than 500 public databases.” The direction of “causality” here is questionable, however.

Hilbert concludes with a warning. The Big Data paradigm “inevitably creates a new dimension of the digital divide: a divide in the capacity to place the analytic treatment of data at the forefront of informed decision-making. This divide does not only refer to the availability of information, but to intelligent decision-making and therefore to a divide in (data-based) knowledge.” While the advent of Big Data Analysis is certainly not a panacea,”in a world where we desperately need further insights into development dynamics, Big Data Analysis can be an important tool to contribute to our understanding of and improve our contributions to manifold development challenges.”

I am troubled by the study’s assumption that we live in a Newtonian world of decision-making in which for every action there is an automatic equal and opposite reaction. The fact of the matter is that the vast majority of development policies and decisions are not based on empirical evidence. Indeed, rigorous evidence-based policy-making and interventions are still very much the exception rather than the rule in international development. Why? “Account-ability is often the unhappy byproduct rather than desirable outcome of innovative analytics. Greater accountability makes people nervous” (Harvard 2013). Moreover, response is always political. But Big Data Analysis runs the risk de-politicize a problem. As Alex de Waal noted over 15 years ago, “one universal tendency stands out: technical solutions are promoted at the expense of political ones.” I hinted at this concern when I first blogged about the UN Global Pulse back in 2009.

In sum, James Scott (one of my heroes) puts it best in his latest book:

“Applying scientific laws and quantitative measurement to most social problems would, modernists believed, eliminate the sterile debates once the ‘facts’ were known. [...] There are, on this account, facts (usually numerical) that require no interpretation. Reliance on such facts should reduce the destructive play of narratives, sentiment, prejudices, habits, hyperbole and emotion generally in public life. [...] Both the passions and the interests would be replaced by neutral, technical judgment. [...] This aspiration was seen as a new ‘civilizing project.’ The reformist, cerebral Progressives in early twentieth-century American and, oddly enough, Lenin as well believed that objective scientific knowledge would allow the ‘administration of things’ to largely replace politics. Their gospel of efficiency, technical training and engineering solutions implied a world directed by a trained, rational, and professional managerial elite. [...].”

“Beneath this appearance, of course, cost-benefit analysis is deeply political. Its politics are buried deep in the techniques [...] how to measure it, in what scale to use, [...] in how observations are translated into numerical values, and in how these numerical values are used in decision making. While fending off charges of bias or favoritism, such techniques [...] succeed brilliantly in entrenching a political agenda at the level of procedures and conventions of calculation that is doubly opaque and inaccessible. [...] Charged with bias, the official can claim, with some truth, that ‘I am just cranking the handle” of a nonpolitical decision-making machine.”

See also:

  • Big Data for Development: Challenges and Opportunities [Link]
  • Beware the Big Errors of Big Data (by Nassim Taleb) [Link]
  • How to Build Resilience Through Big Data [Link]

Social Network Analysis for Digital Humanitarian Response

Monitoring social media for digital humanitarian response can be a massive undertaking. The sheer volume and velocity of tweets generated during a disaster makes real-time social media monitoring particularly challenging if not near impossible. However, two new studies argue that there is “a better way to track the spread of information on Twitter that is much more powerful.”

Twitter-Hadoop31

Manuel Garcia-Herranz and his team at the Autonomous University of Madrid in Spain use small groups of “highly connected Twitter users as ‘sensors’ to detect the emergence of new ideas. They point out that this works because highly co-nnected individuals are more likely to receive new ideas before ordinary users.” The test their hypothesis, the team studied 40 million Twitters users who “together totted up 1.5 billion follows’ and sent nearly half a billion tweets, including 67 million containing hashtags.”

They found that small groups of highly connected Twitter users detect “new hashtags about seven days earlier than the control group.  In fact, the lead time varied between nothing at all and as much as 20 days.” Manuel and his team thus argue that “there’s no point in crunching these huge data sets. You’re far better off picking a decent sensor group and watching them instead.” In other words, “your friends could act as an early warning system, not just for gossip, but for civil unrest and even outbreaks of disease.”

The second study, “Identifying and Characterizing User Communities on Twitter during Crisis Events,” (PDF) is authored by Aditi Gupta et al. Aditi and her co-lleagues analyzed three major crisis events (Hurricane Irene, Riots in England and Earthquake in Virginia) to “to identify the different user communities, and characterize them by the top central users.” Their findings are in line with those shared by the team in Madrid. “[T]he top users represent the topics and opinions of all the users in the community with 81% accuracy on an average.” In sum, “to understand a community, we need to monitor and analyze only these top users rather than all the users in a community.”

How could these findings be used to prioritize the monitoring of social media during disasters? See this blog post for more on the use of social network analysis (SNA) for humanitarian response.

Social Network Analysis of Tweets During Australia Floods

This study (PDF) analyzes the community of Twitter users who disseminated  information during the crisis caused by the Australian floods in 2010-2011. “In times of mass emergencies, a phenomenon known as collective behavior becomes apparent. It consists of socio-behaviors that include intensified information search and information contagion.” The purpose of the Australian floods analysis is to reveal interesting patterns and features of this online community using social network analysis (SNA).

The authors analyzed 7,500 flood-related tweets to understand which users did the tweeting and retweeting. This was done to create nodes and links for SNA, which was able to “identify influential members of the online communities that emerged during the Queensland, NSW and Victorian floods as well as identify important resources being referred to. The most active community was in Queensland, possibly induced by the fact that the floods were orders of mag-nitude greater than in NSW and Victoria.”

The analysis also confirmed “the active part taken by local authorities, namely Queensland Police, government officials and volunteers. On the other hand, there was not much activity from local authorities in the NSW and Victorian floods prompting for the greater use of social media by the authorities concerned. As far as the online resources suggested by users are concerned, no sensible conclusion can be drawn as important ones identified were more of a general nature rather than critical information. This might be comprehensible as it was past the impact stage in the Queensland floods and participation was at much lower levels in the NSW and Victorian floods.”

Social Network Analysis is an under-utilized methodology for the analysis of communication flows during humanitarian crises. Understanding the topology of a social network is key to information diffusion. Think of this as a virus infecting a network. If we want to “infect” a social network with important crisis information as quickly and fully as possible, understanding the network’ topology is a requirement as is, therefore, social network analysis.

Tweeting is Believing? Analyzing Perceptions of Credibility on Twitter

What factors influence whether or not a tweet is perceived as credible? According to this recent study, users have “difficulty discerning truthfulness based on con-tent alone, with message topic, user name, and user image all impacting judg-ments of tweets and authors to varying degrees regardless of the actual truth-fulness of the item.”

For example, “Features associated with low credibility perceptions were the use of non-standard grammar and punctuation, not replacing the default account image, or using a cartoon or avatar as an account image. Following a large number of users was also associated with lower author credibility, especially when unbalanced in comparison to follower count [...].” As for features enhan-cing a tweet’s credibility, these included “author influence (as measured by follower, retweet, and  mention counts), topical expertise (as established through a Twitter homepage bio, history of on-topic tweeting, pages outside of Twitter, or having a location relevant to the topic of the tweet), and reputation (whether an author is someone a user follows, has heard of, or who has an official Twitter account verification seal). Content related features viewed as credibility-enhancing were containing a URL leading to a high-quality site, and the existence of other tweets conveying similar information.”

 In general, users’ ability to “judge credibility in practice is largely limited to those features visible at-a-glance in current UIs (user picture, user name, and tweet content). Conversely, features that often are obscured in the user interface, such as the bio of a user, receive little attention despite their ability to impact cred-ibility judgments.” The table below compares a features’s perceived credibility impact with the attention actually allotted to assessing that feature.

“Message topic influenced perceptions of tweet credibility, with science tweets receiving a higher mean tweet credibility rating than those about either politics  or entertainment. Message topic had no statistically significant impact on perceptions of author credibility.” In terms of usernames, “Authors with topical names were considered more credible than those with traditional user names, who were in turn considered more credible than those with internet name styles.” In a follow up experiment, the study analyzed perceptions of credibility vis-a-vis a user’s image, i.e., the profile picture associated with a given Twitter account. “Use of the default Twitter icon significantly lowers ratings of content and marginally lowers ratings of authors [...]” in comparison to generic, topical, female and male images.

Obviously, “many of these metrics can be faked to varying extents. Selecting a topical username is trivial for a spam account. Manufacturing a high follower to following ratio or a high number of retweets is more difficult but not impossible. User interface changes that highlight harder to fake factors, such as showing any available relationship between a user’s network and the content in question, should help.” Overall, these results “indicate a discrepancy between features people rate as relevant to determining credibility and those that mainstream social search engines make available.” The authors of the study conclude by suggesting changes in interface design that will enhance a user’s ability to make credibility judgements.

“Firstly, author credentials should be accessible at a glance, since these add value and users rarely take the time to click through to them. Ideally this will include metrics that convey consistency (number of tweets on topic) and legitimization by other users (number of mentions or retweets), as well as details from the author’s Twitter page (bio, location, follower/following counts). Second, for con-tent assessment, metrics on number of retweets or number of times a link has been shared, along with who is retweeting and sharing, will provide consumers with context for assessing credibility. [...] seeing clusters of tweets that conveyed similar messages was reassuring to users; displaying such similar clusters runs counter to the current tendency for search engines to strive for high recall by showing a diverse array of retrieved items rather than many similar ones–exploring how to resolve this tension is an interesting area for future work.”

In sum, the above findings and recommendations explain why platforms such as RapportiveSeriously Rapid Source Review (SRSR) and CrisisTracker add so much value to the process of assessing the credibility of tweets in near real-time. For related research: Predicting the Credibility of Disaster Tweets Automatically and: Automatically Ranking the Credibility of Tweets During Major Events.

Analyzing Tweets From Australia’s Worst Bushfires

As many as 400 fires were identified in Victoria on February 7, 2010. These resulted in Australia’s highest ever loss of life from a bushfire; 173 people were killed and over 400 injured. This analysis of 1,684 tweets generated during these fires found that they were “laden with actionable factual information which contrasts with earlier claims that tweets are of no value made of mere random personal notes.”

Of the 705 unique users who exchanged tweets during the fires, only two could be considered “official sources of communication”; both accounts were held by ABC Radio Melbourne. “This demonstrates the lack of state or government based initiatives to use social media tools for official communication purposes. Perhaps the growth in Twitter usage for political campaigns will force policy makers to reconsider.” In any event, about 65% of the tweets had “factual details,” i.e., “more than three of every five tweets had useful information.” In addition, “Almost 22% of the tweets had geographical data thus identifying location of the incident which is critical in crisis reporting.” Around 7% of the tweets were see-king information, help or answers. Finally, close to 5% (about 80 tweets) were “directly actionable.”

While 5% is obviously low, there’s no reason why this figure has to remain this low. If humanitarian organizations were to create demand for posting actionable information on Twitter, this would likely increase the supply of more actionable content. Take for example the pro-active role taken by the Philippines Govern-ment vis-a-vis the use of Twitter for disaster response. In any case, the findings from the above study do reveal that 65% of tweets had useful information. Surely contacting the publishers of those tweets could produce even more directly actionable content—which is why the BBC’s User-Generated Content Hub (UGC) uses follow-up as strategy to verify content posted on social media.

Finally, keep in mind that calls to emergency numbers like “911” in the US and “000” in Australia are not spontaneously actionable. That is, human operators who handle these emergency calls ask a series of detailed questions in order to turn the information into structured, actionable content. Some of these standard questions are: What is your emergency? What is your current location? What is your phone number? What is happening? When did the incident occur? Are there injuries? etc. In other words, without being prompted with specific questions, callers are unlikely to provide as much actionable information. The same is true for the use of twitter in crisis response.

 

Automatically Ranking the Credibility of Tweets During Major Events

In their study, “Credibility Ranking of Tweets during High Impact Events,” authors Aditi Gupta and Ponnurangam Kumaraguru “analyzed the credibility of information in tweets corresponding to fourteen high impact news events of 2011 around the globe.” According to their analysis, “30% of total tweets  about an event contained situational information about the event while 14% was spam.” In addition, about 17% of total tweets contained situational awareness information that was credible.

Workflow

The study analyzed over 35 million tweets posted by ~8 million users based on current trending topics. From this data, the authors identified 14 major events reflected in the tweets. These included the UK riots, Libya crisis, Virginia earthquake and Hurricane Irene, for example.

“Using regression analysis, we identi ed the important content and sourced based features, which can predict the credibility of information in a tweet. Prominent content based features were number of unique characters, swear words, pronouns, and emoticons in a tweet, and user based features like the number of followers and length of username. We adopted a supervised machine learning and relevance feedback approach using the above features, to rank tweets according to their credibility score. The performance of our ranking algorithm signi cantly enhanced when we applied re-ranking strategy. Results show that extraction of credible information from Twitter can be automated with high confi dence.”

The paper is available here (PDF). For more applied research on “information forensics,” please see this link.

How the UN Used Social Media in Response to Typhoon Pablo (Updated)

Our mission as digital humanitarians was to deliver a detailed dataset of pictures and videos (posted on Twitter) which depict damage and flooding following the Typhoon. An overview of this digital response is available here. The task of our United Nations colleagues at the Office of the Coordination of Humanitarian Affairs (OCHA), was to rapidly consolidate and analyze our data to compile a customized Situation Report for OCHA’s team in the Philippines. The maps, charts and figures below are taken from this official report (click to enlarge).

Typhon PABLO_Social_Media_Mapping-OCHA_A4_Portrait_6Dec2012

This map is the first ever official UN crisis map entirely based on data collected from social media. Note the “Map data sources” at the bottom left of the map: “The Digital Humanitarian Network’s Solution Team: Standby Volunteer Task Force (SBTF) and Humanity Road (HR).” In addition to several UN agencies, the government of the Philippines has also made use of this information.

Screen Shot 2012-12-08 at 7.26.19 AM

Screen Shot 2012-12-08 at 7.29.24 AM

The cleaned data was subsequently added to this Google Map and also made public on the official Google Crisis Map of the Philippines.

Screen Shot 2012-12-08 at 7.32.17 AM

One of my main priorities now is to make sure we do a far better job at leveraging advanced computing and microtasking platforms so that we are better prepared the next time we’re asked to repeat this kind of deployment. On the advanced computing side, it should be perfectly feasible to develop an automated way to crawl twitter and identify links to images  and videos. My colleagues at QCRI are already looking into this. As for microtasking, I am collaborating with PyBossa and Crowdflower to ensure that we have highly customizable platforms on stand-by so we can immediately upload the results of QCRI’s algorithms. In sum, we have got to move beyond simple crowdsourcing and adopt more agile micro-tasking and social computing platforms as both are far more scalable.

In the meantime, a big big thanks once again to all our digital volunteers who made this entire effort possible and highly insightful.

Summary: Digital Disaster Response to Philippine Typhoon

Update: How the UN Used Social Media in Response to Typhoon Pablo

The United Nations Office for the Coordination of Humanitarian Affairs (OCHA) activated the Digital Humanitarian Network (DHN) on December 5th at 3pm Geneva time (9am New York). The activation request? To collect all relevant tweets about Typhoon Pablo posted on December 4th and 5th; identify pictures and videos of damage/flooding shared in those tweets; geo-locate, time-stamp and categorize this content. The UN requested that this database be shared with them by 5am Geneva time the following day. As per DHN protocol, the activation request was reviewed within an hour. The UN was informed that the request had been granted and that the DHN was formally activated at 4pm Geneva.

pablo_impact

The DHN is composed of several members who form Solution Teams when the network is activated. The purpose of Digital Humanitarians is to support humanitarian organizations in their disaster response efforts around the world. Given the nature of the UN’s request, both the Standby Volunteer Task Force (SBTF) and Humanity Road (HR) joined the Solution Team. HR focused on analyzing all tweets posted December 4th while the SBTF worked on tweets posted December 5th. Over 20,000 tweets were analyzed. As HR will have a blog post describing their efforts shortly (please check here), I will focus on the SBTF.

Geofeedia Pablo

The Task Force first used Geofeedia to identify all relevant pictures/videos that were already geo-tagged by users. About a dozen were identified in this manner. Meanwhile, the SBTF partnered with the Qatar Foundation Computing Research Institute’s (QCRI) Crisis Computing Team to collect all tweets posted on December 5th with the hashtags endorsed by the Philippine Government. QCRI ran algorithms on the dataset to remove (1) all retweets and (2) all tweets without links (URLs). Given the very short turn-around time requested by the UN, the SBTF & QCRI Teams elected to take a two-pronged approach in the hopes that one, at least, would be successful.

The first approach used  Crowdflower (CF), introduced here. Workers on Crowd-flower were asked to check each Tweet’s URL and determine whether it linked to a picture or video. The purpose was to filter out URLs that linked to news articles. CF workers were also asked to assess whether the tweets (or pictures/videos) provided sufficient geographic information for them to be mapped. This methodology worked for about 2/3 of all the tweets in the database. A review of lessons learned and how to use Crowdflower for disaster response will be posted in the future.

Pybossa Philippines

The second approach was made possible thanks to a partnership with PyBossa, a free, open-source crowdsourcing and micro-tasking platform. This effort is described here in more detail. While we are still reviewing the results of this approach, we expect that  this tool will become the standard for future activations of the Digital Humanitarian Network. I will thus continue working closely with the PyBossa team to set up a standby PyBossa platform ready-for-use at a moment’s notice so that Digital Humanitarians can be fully prepared for the next activation.

Now for the results of the activation. Within 10 hours, over 20,000 tweets were analyzed using a mix of methodologies. By 4.30am Geneva time, the combined efforts of HR and the SBTF resulted in a database of 138 highly annotated tweets. The following meta-data was collected for each tweet:

  • Media Type (Photo or Video)
  • Type of Damage (e.g., large-scale housing damage)
  • Analysis of Damage (e.g., 5 houses flooded, 1 damaged roof)
  • GPS coordinates (latitude/longitude)
  • Province
  • Region
  • Date
  • Link to Photo or Video

The vast majority of curated tweets had latitude and longitude coordinates. One SBTF volunteer (“Mapster”) created this map below to plot the data collected. Another Mapster created a similar map, which is available here.

Pablo Crisis Map Twitter Multimedia

The completed database was shared with UN OCHA at 4.55am Geneva time. Our humanitarian colleagues are now in the process of analyzing the data collected and writing up a final report, which they will share with OCHA Philippines today by 5pm Geneva time.

Needless to say, we all learned a lot thanks to the deployment of the Digital Humanitarian Network in the Philippines. This was the first time we were activated to carry out a task of this type. We are now actively reviewing our combined efforts with the concerted aim of streamlining our workflows and methodologies to make this type effort far easier and quicker to complete in the future. If you have suggestions and/or technologies that could facilitate this kind of digital humanitarian work, then please do get in touch either by posting your ideas in the comments section below or by sending me an email.

Lastly, but definitely most importantly, a big HUGE thanks to everyone who volunteered their time to support the UN’s disaster response efforts in the Philippines at such short notice! We want to publicly recognize everyone who came to the rescue, so here’s a list of volunteers who contributed their time (more to be added!). Without you, there would be no database to share with the UN, no learning, no innovating and no demonstration that digital volunteers can and do make a difference. Thank you for caring. Thank you for daring.

Digital Humanitarian Response to Typhoon Pablo in Philippines

Update: Please help the UN! Tag tweets to support disaster response!

The purpose of this post is to keep notes on our efforts to date with the aim of revisiting these at a later time to write a more polished blog post on said efforts. By “Digital Humanitarian Response” I mean the process of using digital tech-nologies to aid disaster response efforts.

pablo-photos

My colleagues and I at QCRI have been collecting disaster related tweets on Typhoon Pablo since Monday. More specifically, we’ve been collecting those tweets with the hashtags officially endorsed by the government. There were over 13,000 relevant tweets posted on Tuesday alone. We then paid Crowdflower workers to micro-task the tagging of these hash-tagged tweets based on the following categories (click picture to zoom in):

Crowdflower

Several hundred tweets were processed during the first hour. On average, about 750 tweets were processed per hour. Clearly, we’d want that number to be far higher, (hence the need to combine micro-tasking with automated algorithms, as explained in the presentation below). In any event, the micro-tasking could also be accelerated if we increased the pay to Crowdflower workers. As it is, the total cost for processing the 13,000+ tweets came to about $250.

The database of processed tweets was then shared (every couple hours) with the Standby Volunteer Task Force (SBTF). SBTF volunteers (“Mapsters”) only focused on tweets that had been geo-tagged and tagged as relevant (e.g., “Casaualties,” “Infrastructure Damage,” “Needs/Asks,” etc.) by Crowdflower workers. SBTF volunteers then mapped these tweets on a Crowdmap as part of a training exercise for new Mapsters.

Geofeedia Pablo

We’re now talking with a humanitarian colleague in the Philippines who asked whether we can identify pictures/videos shared on social media that show damage, bridges down, flooding, etc. The catch is that these need to have a  location and time/date for them to be actionable. So I went on Geofeedia and scraped the relevant content available there (which Mapsters then added to the Crowdmap). One constraint of Geofeedia (and many other such platforms), however, is that they only map content that has been geo-tagged by users posting said content. This means we may be missing the majority of relevant content.

So my colleagues at QCRI are currently pulling all tweets posted today (Wed-nesday) and running an automated algorithm to identify tweets with URLs/links. We’ll ask Crowdflower workers to process the most recent tweets (and work backwards) by tagging those that: (1) link to pictures/video of damage/flooding, and (2) have geographic information. The plan is to have Mapsters add those tweets to the Crowdmap and to share the latter with our humanitarian colleague in the Philippines.

There are several parts of the above workflows that can (and will) be improved. I for one have already learned a lot just from the past 24 hours. But this is the subject of a future blog post as I need to get back to the work at hand.

Sentiment Analysis of #COP18 Tweets from the UN Climate Conference

The Qatar Foundation’s Computing Research Institute (QCRI) has just launched a live sentiment analysis tool of all #COP18 tweets being posted during the United Nations (UN) Climate Change Conference in Doha, Qatar. The event kicked off on Monday, November 26th and will conclude on Friday, December 7th. While the world’s media is actively covering COP18, social media reports are equally insightful. This explains the rationale behind QCRI’s Live #COP18 Twitter Sentiment Analysis Tool.

QCRI_COP18_Sentiment_Analysis

The first timeline displays the number of positive versus negative tweets posted with the COP18 hashtag. The tweets are automatically tagged as positive or negative using the SentiStrength algorithm, which has the same level of accuracy as that of a person if s/he were to manually tag the tweets. The second timeline simply depicts the average sentiment of #COP18 tweets. Both graphs are auto-matically updated every hour. Note that tweets in all languages are analyzed, not just English-language tweets.

These timelines enable journalists, activists and others to monitor the general mood and reaction to presentations, announcements & conversations happening at the UN Climate Conference. For example, we see a major spike in positive tweets (and to a lesser extent negative tweets) between 10am-11am on November 26th. This is when the Opening Ceremony kicks off, as can be seen from the conference agenda.

Screen Shot 2012-12-01 at 9.30.25 AM

The next highest peak occurs between 6pm-7pm on the 27th, which corresponds to the opening plenary of the Ad Hoc Working Group on the Durban Platform for Enhanced Action (ADP). This group is tasked with establishing an agreement that will legally bind all parties to climate targets for the first time. The tweets are primarily positive, which may reflect a positive start to negotiations on opera-tionalizing the Durban Platform. This news article appears to support this hypo-thesis. At 2pm time on November 28th, the number of positive and negative tweets both peak at approximately the same number, 160 tweets. Twitter users may be evenly divided on a topic being discussed.

QCRI Sentiment Analysis

To find out more, simply scroll to the right of the timelines. You’ll see two twitter streams displayed. The first provides a list of selected positive and negative tweets. More specifically, the most frequently retweeted positive and negative tweets for each day are displayed. This feature enables users to understand how some tweets are driving the sentiment analyses displayed on the timelines. The second twitter stream displays the most recent tweets on the UN Conference.

If you’re interested in displaying these live graphs on your website, simply click on the “Embed link” to grab the code. The code is free, we simply ask that you credit and link to QCRI. If you analyze #COP18 tweets using these timelines, please let us know so we can benefit from your insights during this pivotal conference. The sentiment analysis dashboard was put together by QCRI’s Sofiane AbbarWalid Magdy and myself. We welcome your feedback on how to make this dashboard more useful for future conferences and events. Please note that this site was put together “overnight”; i.e., it was rushed. As such it is only an initial prototype.