Swift River is an Ushahidi initiative to crowdsource the process of data validation. We’re developing a Swift River pilot to complement the VoteReport India crowdsourcing platform we officially launched this week. As part of the Swift River team, I’d like to share with iRevolution readers what I hope the Swift River tool will achieve.
We had an excellent series of brainstorming sessions several weeks ago in Orlando and decided we would combine both natural language processing (NLP) and decentralized human filtering to get one step closer at validating crowdsourced data. Let me expand on how I see both components working individually and together.
Double-counting has typically been the bane of traditional NLP or automated event-data extraction algorithms. At Virtual Research Associates (VRA), for example, we would parse headlines of Reuters newswires in quasi real-time, which meant that a breaking story would typically be updated throughout the day or week.
But the natural language parser was specifically developed to automate event-data extraction based on the parameters “Who did what, to whom, where and when?” In other words, the parser could not distinguish whether coded events were actually the same or related. This tedious task was left to VRA analysts to carry out.
The logic behind eliminating double counting (duplicate event-data) is inevitably reversed given the nature of crowdsourcing. To be sure, the more reports are collected about a specific event, the more likely it is that the event in question actually took place as described by the crowd. Ironically, that is precisely why we want to “drink from the fire hose,” the swift river of data gushing through the wires of social media networks.
We simply need a clever digital straw to filter the torrent of data. This is where our Swift River project comes in and why I first addressed the issue of double counting. One of the central tasks I’d like Swift River to do is to parse the incoming reports from VoteReport India and to cluster them into unique event-clusters. This would be one way to filter the cascading data. Moreover, the parser could potentially help filter fabricated reports.
For example, if 17 individual reports from different sources are submitted over a two-day period about “forged votes,” then the reports in effect self-triangulate or validate each other. Of course, someone (with too much time on their hands) might decide to send 17 false reports about “forged votes.”
Our digital straw won’t filter all the impurities, but automating this first-level filter is surely better than nothing. Automating this process would require that the digital straw automate the extraction of nouns, verbs and place names from each report, i.e., actor, action and location. Date and time would automatically be coded based on when the report was submitted.
Reports that use similar verbs (synonyms) and refer to the same or similar actors at the same location on the same day can then be clustered into appropriate event-clusters. More on that in the section on crowdsourcing the filter below.
A second-level filter would compare the content of the reports to determine if they were exact replicas. In other words, if someone were simply copying and pasting the same report, Swift River could flag those identical reports as suspicious. This means someone gaming the system would have to send multiple reports with different wording, thus making it a bit more time consuming to game the system.
A third-level filter or trip-wire could compare the source of the 17 reports. For example, perhaps 10 reports were submitted by email, 5 by SMS and two by Twitter. The greater the diversity of media used to report an event, the more likely that event actually happened. This means that someone wanting to game the system would have to send several emails, text messages and Tweets using different language to describe a particular event.
A fourth-level filter could identify the email addresses, IP addresses and mobile phone numbers in question to determine if they too were different. A crook trying to game the system would now have to send emails from different accounts and IP addresses, different mobile phone numbers, and so on. Anything “looking suspicious” would be flagged for a human to review; more on that soon. The point is to make the gaming of the system as time consuming and frustrating as possible.
Gaming the System
Of course, if someone is absolutely bent on submitting fabricated data that passes all the filters, then they will. But those individuals probably constitute a minority of offenders. Perhaps the longer and more often they do this, the more likely someone in the crowd will pick up on the con. As for the less die-hard crooks out there, they may try and game the system only to see that their reports do not get mapped. Hopefully they’ll give up.
I do realize I’m giving away some “secrets” to gaming the system, but I hope this will be more a deterrent than an invitation to crack the system. If you do happen to be someone bent on gaming the platform, I wish you’d get in touch with us instead and help us improve the filters. Either way, we’ll learn from you.
No one on the Swift River team claims that 100% of the dirt will be filtered. What we seek to do is develop a digital filter that makes the data that does come through palatable enough for public consumption.
Crowdsourcing the Filter
Remember the unique event-clusters idea from above? These could be visualized in a simple and intuitive manner for human volunteers (the crowd) to filter. Flag icons, perhaps using three different colors—green, orange and red—could indicate how suspicious a specific series of reports might be based on the results of the individual filters described above.
A green flag would indicate that the report has been automatically mapped on VoteReport upon receipt. An orange flag would indicate the need for review by the crowd while a red flag would send an alert for immediate review.
If a member of the crowd does confirm that a series of reports were indeed fabricated, Swift River would note the associated email address(es), IP address(es) and/or mobile phone number(s) and automatically flag future reports from those sources as red. In other words, Swift River would start rating the credibility of users as well.
If we can pull this off, Swift River may actually start to provide “early warning” signals. To be sure, if we fine tune our unique event-cluster approach, a new event-cluster would be created by a report that describes an event which our parser determines has not yet been reported on.
This should set off a (yellow) flag for immediate review by the crowd. This could either be a legitimate new event or a fabricated report that doesn’t fit into pre-existing cluster. Of course, we will get a number of false positives, but that’s precisely why we include the human crowdsourcing element.
Either way, as the Swift River team has already agreed, this process of crowdsourcing the filter needs to be rendered as simple and seamless as possible. This means minimizing the number of clicks and “mouse motions” a user has to make and allowing for short-cut keys to be used, just like in Gmail. In addition, a userfiendly version of the interface should be designed specifically for mobile phones (various platforms and brands).
As always, I’d love to get your feedback.
Patrick Philippe Meier