Tag Archives: patterns

Twitter, Crises and Early Detection: Why “Small Data” Still Matters

My colleagues John Brownstein and Rumi Chunara at Harvard Univer-sity’s HealthMap project are continuing to break new ground in the field of Digital Disease Detection. Using data obtained from tweets and online news, the team was able to identify a cholera outbreak in Haiti weeks before health officials acknowledged the problem publicly. Meanwhile, my colleagues from UN Global Pulse partnered with Crimson Hexagon to forecast food prices in Indonesia by carrying out sentiment analysis of tweets. I had actually written this blog post on Crimson Hexagon four years ago to explore how the platform could be used for early warning purposes, so I’m thrilled to see this potential realized.

There is a lot that intrigues me about the work that HealthMap and Global Pulse are doing. But one point that really struck me vis-a-vis the former is just how little data was necessary to identify the outbreak. To be sure, not many Haitians are on Twitter and my impression is that most humanitarians have not really taken to Twitter either (I’m not sure about the Haitian Diaspora). This would suggest that accurate, early detection is possible even without Big Data; even with “Small Data” that is neither representative or indeed verified. (Inter-estingly, Rumi notes that the Haiti dataset is actually larger than datasets typically used for this kind of study).

In related news, a recent peer-reviewed study by the European Commi-ssion found that the spatial distribution of crowdsourced text messages (SMS) following the earthquake in Haiti were strongly correlated with building damage. Again, the dataset of text messages was relatively small. And again, this data was neither collected using random sampling (i.e., it was crowdsourced) nor was it verified for accuracy. Yet the analysis of this small dataset still yielded some particularly interesting findings that have important implications for rapid damage detection in post-emergency contexts.

While I’m no expert in econometrics, what these studies suggests to me is that detecting change-over–time is ultimately more critical than having a large-N dataset, let alone one that is obtained via random sampling or even vetted for quality control purposes. That doesn’t mean that the latter factors are not important, it simply means that the outcome of the analysis is relatively less sensitive to these specific variables. Changes in the baseline volume/location of tweets on a given topic appears to be strongly correlated with offline dynamics.

What are the implications for crowdsourced crisis maps and disaster response? Could similar statistical analyses be carried out on Crowdmap data, for example? How small can a dataset be and still yield actionable findings like those mentioned in this blog post?

The Mathematics of War: On Earthquakes and Conflicts

A conversation with my colleague Sinan Aral at PopTech 2011 reminded me of some earlier research I had carried out on the mathematics of war. So this is a good time to share some of the findings from this research. The story begins some 60 years ago, when British physicist Lewis Fry Richardson found that international wars follow what is called a power law distribution. A power law distribution relates the frequency and “magnitude” of events. For example, the Richter scale, relates the size of earthquakes to their frequency. Richardson found that the frequency of international wars and the number of causalities each produced followed a power law.

More recently, my colleague Erik-Lars Cederman sought to explain Richardson’s findings in his 2003 peer-reviewed publication “Modeling the Size of Wars: From Billiard Balls to Sandpiles.” However, Lars used an invalid statistical technique to test for power law distributions. In 2005, I began collaborating with Pro-fessors Neil Johnson and Michael Spagat on related research after I came across their fascinating co-authored study that tested casualty distributions in new wars (internal conflicts) for power laws. Though he was not a co-author on the 2005 study, my colleague Sean Gourely presented this research at TED in 2009.

In any case, I invited Michael to present his research at The Fletcher School in the Fall of 2005 to generate interest here. Shortly after, I suggested to Michael that we test whether conflict events, in addition to casualties, followed a power law distribution. I had access to an otherwise proprietary dataset on conflict events that spanned a longer time period than the casualty datasets that he and Neils were working off. I also suggested we try to test whether casualties from natural disasters follow a power law distribution.

We chose to pursue the latter first and I submitted an abstract to the 2006 American Political Science Association (APSA) conference to present our findings. Soon after, I was accepted to the Santa Fe Institute’s Complex Systems Summer Institute for PhD students and took the opportunity to pursue my original research in testing conflict events for power law distributions with my colleague Dr. Ryan Woodard.

The APSA paper, presented in August 2006, was entitled “Natural Disasters, Casualties and Power Laws:  A Comparative Analysis with Armed Conflict” (PDF). Here is the paper’s abstract and findings:

Power-law relationships, relating events with magnitudes to their frequency, are common in natural disasters and violent conflict. Compared to many statistical distributions, power laws drop off more gradually, i.e. they have “fat tails”. Existing studies on natural disaster power laws are mostly confined to physical measurements, e.g., the Richter scale, and seldom cover casualty distributions. Drawing on the Center for Research on the Epidemiology of Disasters (CRED) International Disaster Database, 1980 to 2005, we find strong evidence for power laws in casualty distributions for all disasters combined, both globally and by continent except for North America and non-EU Europe. This finding is timely and gives useful guidance for disaster preparedness and response since natural catastrophes are increasing in frequency and affecting larger numbers of people.  We also find that the slopes of the disaster casualty power laws are much smaller than those for modern wars and terrorism, raising an open question of how to explain the differences. We show that many standard risk quantification methods fail in the case of natural disasters.

apsa1

Dr. Woodard and I presented our research on power laws and conflict events at SFI in June 2006. We produced a paper in August of that year entitled “Concerning Critical Correlations in Conflict, Cooperation and Casualties” (PDF). As the title implies, we also tested whether cooperative events followed a power law. As far as I know, we were the first to test conflict events not to mention cooperative events for power laws. In addition, we looked at conflict/cooperation (C/C) events in Western countries.

The abstract and some findings are included below:

Knowing that the number of casualties of war are distributed as a power law and given a rich data set of conflict and cooperation (C/C) events, we ask: Are there correlations among C/C events? Is there a correlation between C/C events and war casualties? Can C/C data be used as proxy for (potentially) less reliable casualty data? Can C/C data be used in conflict early warning systems? To begin to answer these questions we analyze the distribution of C/C event data for the period 1990–2004 in Afghanistan, Colombia, Iran, Iraq, North Korea, Switzerland, UK and USA. We find that the distributions of individual C/C event types scale as power laws, but only over approximately a single decade, leaving open the possibility of a more appropriate fit (for which we have not yet tested). However, the average exponent of the power law (2.5) is the same as that found in recent studies of casualties of war. We find low levels of correlations between C/C events in Iraq and Afghanistan but not in the other countries studied. We find that the distribution of the sum of all conflict or cooperation events scales exponentially. Finally, we find low levels of correlations between a two year time series of casualties in Afghanistan and the corresponding conflict events.

sfi1sfi2sfi3

I’m looking to discuss all this further with Sinan and learning more about his fascinating area of research.