Tag Archives: Pulse

Social Media: Pulse of the Planet?

In 2010, Hillary Clinton described social media as a new nervous system for our planet (1). So can the pulse of the planet be captured with social media? There are many who are skeptical not least because of the digital divide. “You mean the pulse of the Data Have’s? The pulse of the affluent?” These rhetorical questions are perfectly justified, which is why social media alone should not be the sole source of information that feeds into decision-making for policy purposes. But millions are joining the social media ecosystem everyday. So the selection bias is not increasing but decreasing. We may not be able to capture the pulse of the planet comprehensively and at a very high resolution yet, but the pulse of the majority world is certainly growing louder by the day.


This map of the world at night (based on 2011 data) reveals areas powered by electricity. Yes, Africa has far less electricity consumption. This is not misleading, it is an accurate proxy for industrial development (amongst other indexes). Does this data suffer from selection bias? Yes, the data is biased towards larger cities rather than the long tail. Does this render the data and map useless? Hardly. It all depends on what the question is.

Screen Shot 2013-02-02 at 8.22.49 AM

What if our world was lit up by information instead of lightbulbs? The map above from TweetPing does just that. The website displays tweets in real-time as they’re posted across the world. Strictly speaking, the platform displays 10% of the ~340 million tweets posted each day (i.e., the “Decahose” rather than the “Firehose”). But the volume and velocity of the pulsing ten percent is already breathtaking.

Screen Shot 2013-01-28 at 7.01.36 AM

One may think this picture depicts electricity use in Europe. Instead, this is a map of geo-located tweets (blue dots) and Flickr pictures (red dots). “White dots are locations that have been posted to both” (2). The number of active Twitter users grew an astounding 40% in 2012, making Twitter the fastest growing social network on the planet. Over 20% of the world’s internet population is now on Twitter (3). The Sightsmap below is a heat map based on the number of photographs submitted to Panoramio at different locations.

Screen Shot 2013-02-05 at 7.59.37 AM

The map below depicts friendship ties on Facebook. This was generated using data when there were “only” 500 million users compared to today’s 1 billion+.


The following map does not depict electricity use in the US or the distribution of the population based on the most recent census data. Instead, this is a map of check-in’s on Foursquare. What makes this map so powerful is not only that it was generated using 500 million check-in’s but that “all those check-ins you see aren’t just single points—they’re links between all the other places people have been.”


TwitterBeat takes the (emotional) pulse of the planet by visualizing the Twitter Decahose in real-time using sentiment analysis. The crisis map in the YouTube video below comprises all tweets about Hurricane Sandy over time. “[Y]ou can see how the whole country lights up and how tweets don’t just move linearly up the coast as the storm progresses, capturing the advance impact of such a large storm and its peripheral effects across the country” (4).

These social media maps don’t only “work” at the country level or for Western industrialized states. Take the following map of Jakarta made almost exclusively from geo-tagged tweets. You can see the individual roads and arteries (nervous system). Granted, this map works so well because of the horrendous traffic but nevertheless a pattern emerges, one that is strongly correlated to the Jakarta’s road network. And unlike the map of the world at night, we can capture this pulse in real time and at a fraction of the cost.


Like any young nervous system, our social media system is still growing and evolving. But it is already adding value. The analysis of tweets predicts the flu better than the crunching of traditional data used by public health institutions, for example. And the analysis of tweets from Indonesia also revealed that Twitter data can be used to monitor food security in real-time.

The main problem I see about all this has much less to do with issues of selection bias and unrepresentative samples, etc. Far more problematic is the central-ization of this data and the fact that it is closed data. Yes, the above maps are public, but don’t be fooled, the underlying data is not. In their new study, “The Politics of Twitter Data,” Cornelius Puschmann and Jean Burgess argue that the “owners” of social media data are the platform providers, not the end users. Yes, access to Twitter.com and Twitter’s API is free but end users are limited to downloading just a few thousand tweets per day. (For comparative purposes, more than 20 million tweets were posted during Hurricane Sandy). Getting access to more data can cost hundreds of thousands of dollars. In other words, as Puschmann and Burgess note, “only corporate actors and regulators—who possess both the intellectual and financial resources to succeed in this race—can afford to participate,” which means “that the emerging data market will be shaped according to their interests.”

“Social Media: Pulse of the Planet?” Getting there, but only a few elite Doctors can take the full pulse in real-time.

Big Data for Development: Challenges and Opportunities

The UN Global Pulse report on Big Data for Development ought to be required reading for anyone interested in humanitarian applications of Big Data. The purpose of this post is not to summarize this excellent 50-page document but to relay the most important insights contained therein. In addition, I question the motivation behind the unbalanced commentary on Haiti, which is my only major criticism of this otherwise authoritative report.

Real-time “does not always mean occurring immediately. Rather, “real-time” can be understood as information which is produced and made available in a relatively short and relevant period of time, and information which is made available within a timeframe that allows action to be taken in response i.e. creating a feedback loop. Importantly, it is the intrinsic time dimensionality of the data, and that of the feedback loop that jointly define its characteristic as real-time. (One could also add that the real-time nature of the data is ultimately contingent on the analysis being conducted in real-time, and by extension, where action is required, used in real-time).”

Data privacy “is the most sensitive issue, with conceptual, legal, and technological implications.” To be sure, “because privacy is a pillar of democracy, we must remain alert to the possibility that it might be compromised by the rise of new technologies, and put in place all necessary safeguards.” Privacy is defined by the International Telecommunications Union as theright of individuals to control or influence what information related to them may be disclosed.” Moving forward, “these concerns must nurture and shape on-going debates around data privacy in the digital age in a constructive manner in order to devise strong principles and strict rules—backed by adequate tools and systems—to ensure “privacy-preserving analysis.”

Non-representative data is often dismissed outright since findings based on such data cannot be generalized beyond that sample. “But while findings based on non-representative datasets need to be treated with caution, they are not valueless […].” Indeed, while the “sampling selection bias can clearly be a challenge, especially in regions or communities where technological penetration is low […],  this does not mean that the data has no value. For one, data from “non-representative” samples (such as mobile phone users) provide representative information about the sample itself—and do so in close to real time and on a potentially large and growing scale, such that the challenge will become less and less salient as technology spreads across and within developing countries.”

Perceptions rather than reality is what social media captures. Moreover, these perceptions can also be wrong. But only those individuals “who wrongfully assume that the data is an accurate picture of reality can be deceived. Furthermore, there are instances where wrong perceptions are precisely what is desirable to monitor because they might determine collective behaviors in ways that can have catastrophic effects.” In other words, “perceptions can also shape reality. Detecting and understanding perceptions quickly can help change outcomes.”

False data and hoaxes are part and parcel of user-generated content. While the challenges around reliability and verifiability are real, Some media organizations, such as the BBC, stand by the utility of citizen reporting of current events: “there are many brave people out there, and some of them are prolific bloggers and Tweeters. We should not ignore the real ones because we were fooled by a fake one.” And have thus devised internal strategies to confirm the veracity of the information they receive and chose to report, offering an example of what can be done to mitigate the challenge of false information.” See for example my 20-page study on how to verify crowdsourced social media data, a field I refer to as information forensics. In any event, “whether false negatives are more or less problematic than false positives depends on what is being monitored, and why it is being monitored.”

“The United States Geological Survey (USGS) has developed a system that monitors Twitter for significant spikes in the volume of messages about earthquakes,” and as it turns out, 90% of user-generated reports that trigger an alert have turned out to be valid. “Similarly, a recent retrospective analysis of the 2010 cholera outbreak in Haiti conducted by researchers at Harvard Medical School and Children’s Hospital Boston demonstrated that mining Twitter and online news reports could have provided health officials a highly accurate indication of the actual spread of the disease with two weeks lead time.”

This leads to the other Haiti example raised in the report, namely the finding that SMS data was correlated with building damage. Please see my previous blog posts here and here for context. What the authors seem to overlook is that Benetech apparently did not submit their counter-findings for independent peer-review whereas the team at the European Commission’s Joint Research Center did—and the latter passed the peer-review process. Peer-review is how rigorous scientific work is validated. The fact that Benetech never submitted their blog post for peer-review is actually quite telling.

In sum, while this Big Data report is otherwise strong and balanced, I am really surprised that they cite a blog post as “evidence” while completely ignoring the JRC’s peer-reviewed scientific paper published in the Journal of the European Geosciences Union. Until counter-findings are submitted for peer review, the JRC’s results stand: unverified, non-representative crowd-sourced text messages from the disaster affected population in Port-au-Prince that were in turn translated from Haitian Creole to English via a novel crowdsourced volunteer effort and subsequently geo-referenced by hundreds of volunteers  which did not undergo any quality control, produced a statistically significant, positive correlation with building damage.

In conclusion, “any challenge with utilizing Big Data sources of information cannot be assessed divorced from the intended use of the information. These new, digital data sources may not be the best suited to conduct airtight scientific analysis, but they have a huge potential for a whole range of other applications that can greatly affect development outcomes.”

One such application is disaster response. Earlier this year, FEMA Administrator Craig Fugate, gave a superb presentation on “Real Time Awareness” in which he relayed an example of how he and his team used Big Data (twitter) during a series of devastating tornadoes in 2011:

“Mr. Fugate proposed dispatching relief supplies to the long list of locations immediately and received pushback from his team who were concerned that they did not yet have an accurate estimate of the level of damage. His challenge was to get the staff to understand that the priority should be one of changing outcomes, and thus even if half of the supplies dispatched were never used and sent back later, there would be no chance of reaching communities in need if they were in fact suffering tornado damage already, without getting trucks out immediately. He explained, “if you’re waiting to react to the aftermath of an event until you have a formal assessment, you’re going to lose 12-to-24 hours…Perhaps we shouldn’t be waiting for that. Perhaps we should make the assumption that if something bad happens, it’s bad. Speed in response is the most perishable commodity you have…We looked at social media as the public telling us enough information to suggest this was worse than we thought and to make decisions to spend [taxpayer] money to get moving without waiting for formal request, without waiting for assessments, without waiting to know how bad because we needed to change that outcome.”

“Fugate also emphasized that using social media as an information source isn’t a precise science and the response isn’t going to be precise either. “Disasters are like horseshoes, hand grenades and thermal nuclear devices, you just need to be close— preferably more than less.”

Twitter, Crises and Early Detection: Why “Small Data” Still Matters

My colleagues John Brownstein and Rumi Chunara at Harvard Univer-sity’s HealthMap project are continuing to break new ground in the field of Digital Disease Detection. Using data obtained from tweets and online news, the team was able to identify a cholera outbreak in Haiti weeks before health officials acknowledged the problem publicly. Meanwhile, my colleagues from UN Global Pulse partnered with Crimson Hexagon to forecast food prices in Indonesia by carrying out sentiment analysis of tweets. I had actually written this blog post on Crimson Hexagon four years ago to explore how the platform could be used for early warning purposes, so I’m thrilled to see this potential realized.

There is a lot that intrigues me about the work that HealthMap and Global Pulse are doing. But one point that really struck me vis-a-vis the former is just how little data was necessary to identify the outbreak. To be sure, not many Haitians are on Twitter and my impression is that most humanitarians have not really taken to Twitter either (I’m not sure about the Haitian Diaspora). This would suggest that accurate, early detection is possible even without Big Data; even with “Small Data” that is neither representative or indeed verified. (Inter-estingly, Rumi notes that the Haiti dataset is actually larger than datasets typically used for this kind of study).

In related news, a recent peer-reviewed study by the European Commi-ssion found that the spatial distribution of crowdsourced text messages (SMS) following the earthquake in Haiti were strongly correlated with building damage. Again, the dataset of text messages was relatively small. And again, this data was neither collected using random sampling (i.e., it was crowdsourced) nor was it verified for accuracy. Yet the analysis of this small dataset still yielded some particularly interesting findings that have important implications for rapid damage detection in post-emergency contexts.

While I’m no expert in econometrics, what these studies suggests to me is that detecting change-over–time is ultimately more critical than having a large-N dataset, let alone one that is obtained via random sampling or even vetted for quality control purposes. That doesn’t mean that the latter factors are not important, it simply means that the outcome of the analysis is relatively less sensitive to these specific variables. Changes in the baseline volume/location of tweets on a given topic appears to be strongly correlated with offline dynamics.

What are the implications for crowdsourced crisis maps and disaster response? Could similar statistical analyses be carried out on Crowdmap data, for example? How small can a dataset be and still yield actionable findings like those mentioned in this blog post?