Tag Archives: Open

Opening World Bank Data with QCRI’s GeoTagger

My colleagues and I at QCRI partnered with the World Bank several months ago to develop an automated GeoTagger platform to increase the transparency and accountability of international development projects by accelerating the process of opening key development and finance data. We are proud to launch the first version of the GeoTagger platform today. The project builds on the Bank’s Open Data Initiatives promoted by former President, Robert Zoellick, and continued under the current leadership of Dr. Jim Yong Kim.

QCRI GeoTagger 1

The Bank has accumulated an extensive amount of socio-economic data as well as a massive amount of data on Bank-sponsored development projects worldwide. Much of this data, however, is not directly usable by the general public due to numerous data format, quality and access issues. The Bank therefore launched their “Mapping for Results” initiative to visualize the location of Bank-financed projects to better monitor development impact, improve aid effectiveness and coordination while enhancing transparency and social accountability. The geo-tagging of this data, however, has been especially time-consuming and tedious. Numerous interns were required to manually read through tens of thousands of dense World Bank project documentation, safeguard documents and results reports to identify and geocode exact project locations. But there are hundreds of thousands of such PDF documents. To make matters worse, these documents make seemingly “random” passing references to project locations, with no sign of any  standardized reporting structure whatsoever.

QCRI GeoTagger 2

The purpose of QCRI’s GeoTagger Beta is to automatically “read” through these countless PDF documents to identify and map all references to locations. GeoTagger does this using the World Bank Projects Data API and the Stanford Name Entity Recognizer (NER) & Alchemy. These tools help to automatically search through documents and identify place names, which are then geocoded using the Google GeocoderYahoo! Placefinder & Geonames and placed on a de-dicated map. QCRI’s GeoTagger will remain freely available and we’ll be making the code open source as well.

Naturally, this platform could be customized for many different datasets and organizations, which is why we’ve already been approached by a number of pro-spective partners to explore other applications. So feel free to get in touch should this also be of interest to your project and/or organization. In the meantime, a very big thank you to my colleagues at QCRI’s Big Data Analytics Center: Dr. Ihab Ilyas, Dr. Shady El-Bassuoni, Mina Farid and last but certainly not least, Ian Ye for their time on this project. Many thanks as well to my colleagues Johannes Kiess, Aleem Walji and team from the World Bank and Stephen Davenport at Development Gateway for the partnership.



Big Data for Development: From Information to Knowledge Societies?

Unlike analog information, “digital information inherently leaves a trace that can be analyzed (in real-time or later on).” But the “crux of the ‘Big Data’ paradigm is actually not the increasingly large amount of data itself, but its analysis for intelligent decision-making (in this sense, the term ‘Big Data Analysis’ would actually be more fitting than the term ‘Big Data’ by itself).” Martin Hilbert describes this as the “natural next step in the evolution from the ‘Information Age’ & ‘Information Societies’ to ‘Knowledge Societies’ [...].”

Hilbert has just published this study on the prospects of Big Data for inter-national development. “From a macro-perspective, it is expected that Big Data informed decision-making will have a similar positive effect on efficiency and productivity as ICT have had during the recent decade.” Hilbert references a 2011 study that concluded the following: “firms that adopted Big Data Analysis have output and productivity that is 5–6 % higher than what would be expected given their other investments and information technology usage.” Can these efficiency gains be brought to the unruly world of international development?

To answer this question, Hilbert introduces the above conceptual framework to “systematically review literature and empirical evidence related to the pre-requisites, opportunities and threats of Big Data Analysis for international development.” Words, Locations, Nature and Behavior are types of data that are becoming increasingly available in large volumes.

“Analyzing comments, searches or online posts [i.e., Words] can produce nearly the same results for statistical inference as household surveys and polls.” For example, “the simple number of Google searches for the word ‘unemployment’ in the U.S. correlates very closely with actual unemployment data from the Bureau of Labor Statistics.” Hilbert argues that the tremendous volume of free textual data makes “the work and time-intensive need for statistical sampling seem almost obsolete.” But while the “large amount of data makes the sampling error irrelevant, this does not automatically make the sample representative.” 

The increasing availability of Location data (via GPS-enabled mobile phones or RFIDs) needs no further explanation. Nature refers to data on natural processes such as temperature and rainfall. Behavior denotes activities that can be captured through digital means, such as user-behavior in multiplayer online games or economic affairs, for example. But “studying digital traces might not automatically give us insights into offline dynamics. Besides these biases in the source, the data-cleaning process of unstructured Big Data frequently introduces additional subjectivity.”

The availability and analysis of Big Data is obviously limited in areas with scant access to tangible hardware infrastructure. This corresponds to the “Infra-structure” variable in Hilbert’s framework. “Generic Services” refers to the production, adoption and adaptation of software products, since these are a “key ingredient for a thriving Big Data environment.” In addition, the exploitation of Big Data also requires “data-savvy managers and analysts and deep analytical talent, as well as capabilities in machine learning and computer science.” This corresponds to “Capacities and Knowledge Skills” in the framework.

The third and final side of the framework represents the types of policies that are necessary to actualize the potential of Big Data for international develop-ment. These policies are divided into those that elicit a Positive Feedback Loops such as financial incentives and those that create regulations such as interoperability, that is, Negative Feedback Loops.

The added value of Big Data Analytics is also dependent on the availability of publicly accessible data, i.e., Open Data. Hilbert estimates that a quarter of US government data could be used for Big Data Analysis if it were made available to the public. There is a clear return on investment in opening up this data. On average, governments with “more than 500 publicly available databases on their open data online portals have 2.5 times the per capita income, and 1.5 times more perceived transparency than their counterparts with less than 500 public databases.” The direction of “causality” here is questionable, however.

Hilbert concludes with a warning. The Big Data paradigm “inevitably creates a new dimension of the digital divide: a divide in the capacity to place the analytic treatment of data at the forefront of informed decision-making. This divide does not only refer to the availability of information, but to intelligent decision-making and therefore to a divide in (data-based) knowledge.” While the advent of Big Data Analysis is certainly not a panacea,”in a world where we desperately need further insights into development dynamics, Big Data Analysis can be an important tool to contribute to our understanding of and improve our contributions to manifold development challenges.”

I am troubled by the study’s assumption that we live in a Newtonian world of decision-making in which for every action there is an automatic equal and opposite reaction. The fact of the matter is that the vast majority of development policies and decisions are not based on empirical evidence. Indeed, rigorous evidence-based policy-making and interventions are still very much the exception rather than the rule in international development. Why? “Account-ability is often the unhappy byproduct rather than desirable outcome of innovative analytics. Greater accountability makes people nervous” (Harvard 2013). Moreover, response is always political. But Big Data Analysis runs the risk de-politicize a problem. As Alex de Waal noted over 15 years ago, “one universal tendency stands out: technical solutions are promoted at the expense of political ones.” I hinted at this concern when I first blogged about the UN Global Pulse back in 2009.

In sum, James Scott (one of my heroes) puts it best in his latest book:

“Applying scientific laws and quantitative measurement to most social problems would, modernists believed, eliminate the sterile debates once the ‘facts’ were known. [...] There are, on this account, facts (usually numerical) that require no interpretation. Reliance on such facts should reduce the destructive play of narratives, sentiment, prejudices, habits, hyperbole and emotion generally in public life. [...] Both the passions and the interests would be replaced by neutral, technical judgment. [...] This aspiration was seen as a new ‘civilizing project.’ The reformist, cerebral Progressives in early twentieth-century American and, oddly enough, Lenin as well believed that objective scientific knowledge would allow the ‘administration of things’ to largely replace politics. Their gospel of efficiency, technical training and engineering solutions implied a world directed by a trained, rational, and professional managerial elite. [...].”

“Beneath this appearance, of course, cost-benefit analysis is deeply political. Its politics are buried deep in the techniques [...] how to measure it, in what scale to use, [...] in how observations are translated into numerical values, and in how these numerical values are used in decision making. While fending off charges of bias or favoritism, such techniques [...] succeed brilliantly in entrenching a political agenda at the level of procedures and conventions of calculation that is doubly opaque and inaccessible. [...] Charged with bias, the official can claim, with some truth, that ‘I am just cranking the handle” of a nonpolitical decision-making machine.”

See also:

  • Big Data for Development: Challenges and Opportunities [Link]
  • Beware the Big Errors of Big Data (by Nassim Taleb) [Link]
  • How to Build Resilience Through Big Data [Link]

Why Ushahidi Should Embrace Open Data

“This is the report that Ushahidi did not want you to see.” Or so the rumors in certain circles would have it. Some go as far as suggesting that Ushahidi tried to burry or delay the publication. On the other hand, some rumors claim that the report was a conspiracy to malign and discredit Ushahidi. Either way, what is clear is this: Ushahidi is an NGO that prides itself in promoting transparency & accountability; an organization prepared to take risks—and yes fail—in the pursuit of this  mission.

The report in question is CrowdGlobe: Mapping the Maps. A Meta-level Analysis of Ushahidi & Crowdmap. Astute observers will discover that I am indeed one of the co-authors. Published by Internews in collaboration with George Washington University, the report (PDF) reveals that 93% of 12,000+ Crowdmaps analyzed had fewer than 10 reports while a full 61% of Crowdmaps had no reports at all. The rest of the findings are depicted in the infographic below (click to enlarge) and eloquently summarized in the above 5-minute presentation delivered at the 2012 Crisis Mappers Conference (ICCM 2012).

Infographic_2_final (2)

Back in 2011, when my colleague Rob Baker (now with Ushahidi) generated the preliminary results of the quantitative analysis that underpins much of the report, we were thrilled to finally have a baseline against which to measure and guide the future progress of Ushahidi & Crowdmap. But when these findings were first publicly shared (August 2012), they were dismissed by critics who argued that the underlying data was obsolete. Indeed, much of the data we used in the analysis dates back to 2010 and 2011. Far from being obsolete, however, this data provides a baseline from which the use of the platform can be measured over time. We are now in 2013 and there are apparently 36,000+ Crowdmaps today rather than just 12,000+.

To this end, and as a member of Ushahidi’s Advisory Board, I have recommended that my Ushahidi colleagues run the same analysis on the most recent Crowdmap data in order to demonstrate the progress made vis-a-vis the now-outdated public baseline. (This analysis takes no more than an hour a few days to carry out). I also strongly recommend that all this anonymized meta-data be made public on a live dashboard in the spirit of open data and transparency. Ushahidi, after all, is a public NGO funded by some of the biggest proponents of open data and transparency in the world.

Embracing open data is one of the best ways for Ushahidi to dispel the harmful rumors and conspiracy theories that continue to swirl as a result of the Crowd-Globe report. So I hope that my friends at Ushahidi will share their updated analysis and live dashboard in the coming weeks. If they do, then their bold support of this report and commitment to open data will serve as a model for other organizations to emulate. If they’ve just recently resolved to make this a priority, then even better.

In the meantime, I look forward to collaborating with the entire Ushahidi team on making the upcoming Kenyan elections the most transparent to date. As referenced in this blog post, the Standby Volunteer Task Force (SBTF) is partnering with the good people at PyBossa to customize an awesome micro-tasking platform that will significantly facilitate and accelerate the categorization and geo-location of reports submitted to the Ushahidi platform. So I’m working hard with both of these outstanding teams to make this the most successful, large-scale microtasking effort for election monitoring yet. Now lets hope for everyone’s sake that the elections remain peaceful. Onwards!

Google Inc + World Bank = Empowering Citizen Cartographers?

World Bank Managing Director Caroline Anstey recently announced a new partnership with Google that will apparently empower citizen cartographers in 150 countries worldwide. This has provoked some concern among open source enthusiasts. Under this new agreement, the Bank, UN agencies and developing country governments will be able to “access Google Map Maker’s global mapping platform, allowing the collection, viewing, search and free access to data of geoinformation in over 150 countries and 60 languages.”

So what’s the catch? Google’s licensing agreement for Google Map Maker stipulates the following: Users are not allowed to access Google Map Maker data via any platform other than those designated by Google. Users are not allowed to make any copies of the data, nor can they translate the data, modify it or create a derivative of the data. In addition, users cannot publicly display any Map Maker data for commercial purposes. Finally, users cannot use Map Maker data to create a service that is similar to any already provided by Google.

There’s a saying in the tech world that goes like this: “If the product is free, then you are the product.” I fear this may be the case with the Google-Bank partnership. I worry that Google will organize more crowdsourced mapping projects (like the one they did for Sudan last year), and use people with local knowledge to improve Map Maker data, which will carry all the licensing restrictions described above. Does this really empower citizen cartographers?

Or is this about using citizen cartographers (as free labor?) for commercial purposes? Will Google push Map Maker data to Google Maps & Google Earth products, i.e., expanding market share & commercial interests? Contrast this with the World Bank’s Open Data for Resilience Initiative (OpenDRI), which uses open source software and open data to empower local communities and disaster risk managers. Also, the Google-Bank partnership is specifically with UN agencies and governments, not exactly citizens or NGOs.

Caroline Anstey concludes her announcement with the following:

“In the 17th century, imperial cartographers had an advantage over local communities. They could see the big picture. In the 21st century, the tables have turned: local communities can make the biggest on the ground difference. Crowdsourced citizen cartographers can help make it happen.”

 Here’s another version:

“In the 21st century, for-profit companies like Google Inc have an advantage over local communities. They can use big license restrictions. With the Google-Bank partnership, Google can use local communities to collect information for free and make the biggest profit. Crowdsourced citizen cartographers can help make it happen.”

The Google-Bank partnership points to another important issue being ignored in this debate. Let’s not pretend that technology alone determines whether participatory mapping truly empowers local communities. I recently learned of an absolutely disastrous open source “community” mapping project in Africa which should one day should be written up in a blog post entitled “Open Source Community Mapping #FAIL”.

So software developers (whether from the open source or proprietary side) who want to get involved in community mapping and have zero experience in participatory GIS, local development and capacity building should think twice: the “do no harm” principle also applies to them. This is equally true of Google Inc. The entire open source mapping community will be watching every move they make on this new World Bank partnership.

I do hope Google eventually realizes just how much of an opportunity they have to do good with this partnership. I am keeping my fingers crossed that they will draft a separate licensing agreement for the World Bank partnership. In fact, I hope they openly invite the participatory GIS and open source mapping communities to co-draft an elevated licensing agreement that will truly empower citizen cartographers. Google would still get publicity—and more importantly positive publicity—as a result. They’d still get the data and have their brand affiliated with said data. But instead of locking up the Map Maker data behind bars and financially profiting from local communities, they’d allow citizens themselves to use the data in whatever platform they so choose to improve citizen feedback in project planning, implementation and monitoring & evaluation. Now wouldn’t that be empowering?