Tag Archives: migration

Inferring International and Internal Migration Patterns from Twitter

My QCRI colleagues Kiran Garimella and Ingmar Weber recently co-authored an important study on migration patterns discerned from Twitter. The study was co-authored with  Bogdan State (Stanford)  and lead author Emilio Zagheni (CUNY). The authors analyzed 500,000 Twitter users based in OECD countries between May 2011 and April 2013. Since Twitter users are not representative of the OECD population, the study uses a “difference-in-differences” approach to reduce selection bias when in out-migration rates for individual countries. The paper is available here and key insights & results are summarized below.

Twitter Migration

To better understand the demographic characteristics of the Twitter users under study, the authors used face recognition software (Face++) to estimate both the gender and age of users based on their profile pictures. “Face++ uses computer vision and data mining techniques applied to a large database of celebrities to generate estimates of age and sex of individuals from their pictures.” The results are depicted below (click to enlarge). Naturally, there is an important degree of uncertainty about estimates for single individuals. “However, when the data is aggregated, as we did in the population pyramid, the uncertainty is substantially reduced, as overestimates and underestimates of age should cancel each other out.” One important limitation is that age estimates may still be biased if users upload younger pictures of themselves, which would result in underestimating the age of the sample population. This is why other methods to infer age (and gender) should also be applied.

Twitter Migration 3

I’m particularly interested in the bias-correction “difference-in-differences” method used in this study, which demonstrates one can still extract meaningful information about trends even though statistical inferences cannot be inferred since the underlying data does not constitute a representative sample. Applying this method yields the following results (click to enlarge):

Twitter Migration 2

The above graph reveals a number of interesting insights. For example, one can observe a decline in out-migration rates from Mexico to other countries, which is consistent with recent estimates from Pew Research Center. Meanwhile, in Southern Europe, the results show that out-migration flows continue to increase for  countries that were/are hit hard by the economic crisis, like Greece.

The results of this study suggest that such methods can be used to “predict turning points in migration trends, which are particularly relevant for migration forecasting.” In addition, the results indicate that “geolocated Twitter data can substantially improve our understanding of the relationships between internal and international migration.” Furthermore, since the study relies in publicly available, real-time data, this approach could also be used to monitor migration trends on an ongoing basis.

To which extent the above is feasible remains to be seen. Very recent mobility data from official statistics are simply not available to more closely calibrate and validate the study’s results. In any event, this study is an important towards addressing a central question that humanitarian organizations are also asking: how can we make statistical inferences from online data when ground-truth data is unavailable as a reference?

I asked Emilio whether techniques like “difference-in-differences” could be used to monitor forced migration. As he noted, there is typically little to no ground truth data available in humanitarian crises. He thus believes that their approach is potentially relevant to evaluate forced migration. That said, he is quick to caution against making generalizations. Their study focused on OECD countries, which represent relatively large samples and high Internet diffusion, which means low selection bias. In contrast, data samples for humanitarian crises tend to be far smaller and highly selected. This means that filtering out the bias may prove more difficult. I hope that this is a challenge that Emilio and his co-authors choose to take on in the near future.


Using E-Mail Data to Estimate International Migration Rates

As is well known, “estimates of demographic flows are inexistent, outdated, or largely inconsistent, for most countries.” I would add costly to that list as well. So my QCRI colleague Ingmar Weber co-authored a very interesting study on the use of e-mail data to estimate international migration rates.

The study analyzes a large sample of Yahoo! emails sent by 43 million users between September 2009 and June 2011. “For each message, we know the date when it was sent and the geographic location from where it was sent. In addition, we could link the message with the person who sent it, and with the user’s demographic information (date of birth and gender), that was self reported when he or she signed up for a Yahoo! account. We estimated the geographic location from where each email message was sent using the IP address of the user.”

The authors used data on existing migration rates for a dozen countries and international statistics on Internet diffusion rates by age and gender in order to correct for selection bias. For example, “estimated number of migrants, by age group and gender, is multiplied by a correction factor to adjust for over-representation of more educated and mobile people in groups for which the Internet penetration is low.” The graphs below are estimates of age and gender-specific immigration rates for the Philippines. “The gray area represents the size of the bias correction.” This means that “without any correction for bias, the point estimates would be at the upper end of the gray area.” These methods “correct for the fact that the group of users in the sample, although very large, is not representative of the entire population.”

The results? Ingmar and his co-author Emilio Zagheni were able to “estimate migration rates that are consistent with the ones published by those few countries that compile migration statistics. By using the same method for all geographic regions, we obtained country statistics in a consistent way, and we generated new information for those countries that do not have registration systems in place (e.g., developing countries), or that do not collect data on out-migration (e.g., the United States).” Overall, the study documented a “global trend of increasing mobility,” which is “growing at a faster pace for females than males. The rate of increase for different age groups varies across countries.”

The authors argue that this approach could also be used in the context of “natural” disasters and man-made disasters. In terms of future research, they are interested in evaluating “whether sending a high proportion of e-mail messages to a particular country (which is a proxy for having a strong social network in the country) is related to the decision of actually moving to the country.” Naturally, they are also interested in analyzing Twitter data. “In addition to mobility or migration rates, we could evaluate sentiments pro or against migration for different geographic areas. This would help us understand how sentiments change near an international border or in regions with different migration rates and economic conditions.”

I’m very excited to have Ingmar at QCRI so we can explore these ideas further and in the context of humanitarian and development challenges. I’ve been dis-cussing similar research ideas with my colleagues at UN Global Pulse and there may be a real sweet spot for collaboration here, particularly with the recently launched Pulse Lab in Jakarta.” The possibility of collaborating with my collea-gues at Flowminder could also be really interesting given their important study of population movement following the Haiti Earthquake. In conclusion, I fully share the authors’ sentiment when they highlight the fact that it is “more and more important to develop models for data sharing between private com-panies and the academic world, that allow for both protection of users’ privacy & private companies’ interests, as well as reproducibility in scientific publishing.”