Tag Archives: andy

Information Forensics: Five Case Studies on How to Verify Crowdsourced Information from Social Media

My 20+ page study on verifying crowdsourced information is now publicly available here as a PDF and here as an open Google Doc for comments. I very much welcome constructive feedback from iRevolution readers so I can improve the piece before it gets published in an edited book next year.

Abstract

False information can cost lives. But no information can also cost lives, especially in a crisis zone. Indeed, information is perishable so the potential value of information must be weighed against the urgency of the situation. Correct information that arrives too late is useless. Crowdsourced information can provide rapid situational awareness, especially when added to a live crisis map. But information in the social media space may not be reliable or immediately verifiable. This may explain why humanitarian (and news) organizations are often reluctant to leverage crowdsourced crisis maps. Many believe that verifying crowdsourced information is either too challenging or impossible. The purpose of this paper is to demonstrate that concrete strategies do exist for the verification of geo-referenced crowdsourced social media information. The study first provides a brief introduction to crisis mapping and argues that crowdsourcing is simply non-probability sampling. Next, five case studies comprising various efforts to verify social media are analyzed to demonstrate how different verification strategies work. The five case studies are: Andy Carvin and Twitter; Kyrgyzstan and Skype; BBC’s User-Generated Content Hub; the Standby Volunteer Task Force (SBTF); and U-Shahid in Egypt. The final section concludes the study with specific recommendations.

Update: See also this link and my other posts on Information Forensics.

How to Verify Social Media Content: Some Tips and Tricks on Information Forensics

Update: I have authored a 20+ page paper on verifying social media content based on 5 case studies. Please see this blog post for a copy.

I get this question all the time: “How do you verify social media data?” This question drives many of the conversations on crowdsourcing and crisis mapping these days. It’s high time that we start compiling our tips and tricks into an online how-to-guide so that we don’t have to start from square one every time the question comes up. We need to build and accumulate our shared knowledge in information forensics. So here is the Google Doc version of this blog post, please feel free to add your best practices and ask others to contribute. Feel free to also add links to other studies on verifying social media content.

If every source we monitored in the social media space was known and trusted, then the need for verification would not be as pronounced. In other words, it is the plethora and virtual anonymity of sources that makes us skeptical of the content they deliver. The process of verifying  social media data thus requires a two-step process: the authentication of the source as reliable and the triangulation of the content as valid. If we can authenticate the source and find it trustworthy, this may be sufficient to trust the content and mark is a verified depending on context. If source authentication is difficult to ascertain, then we need to triangulate the content itself.

Lets unpack these two processes—authentication and triangulation—and apply them to Twitter since the most pressing challenges regarding social media verification have to do with eyewitness, user-generated content. The first step is to try and determine whether the source is trustworthy. Here are some tips on how to do this:

  • Bio on Twitter: Does the source provide a name, picture, bio and any  links to their own blog, identity, professional occupation, etc., on their page? If there’s a name, does searching for this name on Google provide any further clues to the person’s identity? Perhaps a Facebook page, a professional email address, a LinkedIn profile?
  • Number of Tweets: Is this a new Twitter handle with only a few tweets? If so, this makes authentication more difficult. Arasmus notes that “the more recent, the less reliable and the more likely it is to be an account intended to spread disinformation.” In general, the longer the Twitter handle has been around and the more Tweets linked to this handle, the better. This gives a digital trace, a history of prior evidence that can be scrutinized for evidence of political bias, misinformation, etc. Arasmus specifies: “What are the tweets like? Does the person qualify his/her reports? Are they intelligible? Is the person given to exaggeration and inconsistencies?”
  • Number of followers: Does the source have a large following? If there are only a few, are any of the followers know and credible sources? Also, how many lists has this Twitter hanlde been added to?
  • Number following: How many Twitter users does the Twitter handle follow? Are these known and credible sources?
  • Retweets: What type of content does the Twitter handle retweet? Does the Twitter handle in question get retweeted by known and credible sources?
  • Location: Can the source’s geographic location be ascertained? If so, are they nearby the unfolding events? One way to try and find out by proxy is to examine during which periods of the day/night the source tweets the most. This may provide an indication as to the person’s time zone.
  • Timing: Does the source appear to be tweeting in near real-time? Or are there considerable delays? Does anything appear unusual about the timing of the person’s tweets?
  • Social authentication: If you’re still unsure about the source’s reliability, use your own social network–Twitter, Facebook, LinkedIn–to find out if anyone in your network know about the source’s reliability.
  • Media authentication: Is the source quoted by trusted media outlines whether this be in the mainstream or social media space?
  • Engage the source: Tweet them back and ask them for further information. NPR’s Andy Carvin has employed this technique particularly well. For example, you can tweet back and ask for the source of the report and for any available pictures, videos, etc. Place the burden of proof on the source.

These are some of the tips that come to mind for source authentication. For more thoughts on this process, see my previous blog post “Passing the I’m-Not-Gaddafi-Test: Authenticating Identity During Crisis Mapping Operations.” If you some tips of your own not listed here, please do add them to the Google Doc—they don’t need to be limited to Twitter either.

Now, lets say that we’ve gone through list above and find the evidence inconclusive. We thus move to try and triangulate the content. Here are some tips on how to do this:

  • Triangulation: Are other sources on Twitter or elsewhere reporting on the event you are investigating? As Arasmus notes, “remain skeptical about the reports that you receive. Look for multiple reports from different unconnected sources.” The more independent witnesses you can get information from the better and the less critical the need for identity authentication.
  • Origins: If the user reporting an event is not necessarily the original source, can the original source be identified and authenticated? In particular, if the original source is found, does the time/date of the original report make sense given the situation?
  • Social authentication: Ask members of your own social network whether the tweet you are investigating is being reported by other sources. Ask them how unusual the event reporting is to get a sense of how likely it is to have happened in the first place. Andy Carvin’s followers, for example, “help him translate, triangulate, and track down key information. They enable remarkable acts of crowdsourced verification [...] but he must always tell himself to check and challenge what he is told.”
  • Language: Andy Carvin notes that tweets that sound too official, using official language like “breaking news”, “urgent”, “confirmed” etc. need to be scrutinized. “When he sees these terms used, Carvin often replies and asks for additional details, for pictures and video. Or he will quote the tweet and add a simple one word question to the front of the message: Source?” The BBC’s UGC (user-generated content) Hub in London also verifies whether the vocabulary, slang, accents are correct for the location that a source might claim to be reporting from.
  • Pictures: If the twitter handle shares photographic “evidence”, does the photo provide any clues about the location where it was taken based on buildings, signs, cars, etc., in the background? The BBC’s UGC Hub checks weaponry against those know for the given country and also looks for shadows to determine the possible time of day that a picture was taken. In addition, they examine weather reports to “confirm that the conditions shown fit with the claimed date and time.” These same tips can be applied to Tweets that share video footage.
  • Follow up: If you have contacts in the geographic area of interest, then you could ask them to follow up directly/in-person to confirm the validity of the report. Obviously this is not always possible, particularly in conflict zones. Still, there is increasing anecdotal evidence that this strategy is being used by various media organizations and human rights groups. One particularly striking example comes from Kyrgyzstan where  a Skype group with hundreds of users across the country were able disprove and counter rumors at a breathtaking pace. See this blog post for more details. See my blog post on “How to Use Technology to Counter Rumors During Crises: Anecdotes from Kyrgyzstan.”

These are just a handful of tips and tricks come to mind. The number of bullet points above clearly shows we are not completely powerless when verifying social media data. There are several strategies available. The main challenge, as the BBC points out, is that this type of information forensics “can take anything from seconds [...] to hours, as we hunt for clues and confirmation.” See for example my earlier post on “The Crowdsourcing Detective: Crisis, Deception and Intrigue in the Twitterspehere” which highlights some challenges but also new opportunities.

One of Storyful‘s comparative strengths when it comes to real-time news curation is the growing list of authenticated users it follows. This represents more of a bounded (but certainly not static) approach.  As noted in my previous blog post on “Seeking the Trustworthy Tweet,” following a bounded model presents some obvious advantages. This explains by the BBC recommends “maintaining lists of previously verified material [and sources] to act as a reference for colleagues covering the stories.” This strategy is also employed by the Verification Team of the Standby Volunteer Task Force (SBTF).

In sum, I still stand by my earlier blog post entitled “Wag the Dog: How Falsifying Crowdsourced Data can be a Pain.” I also continue to stand by my opinion that some data–even if not immediately verifiable—is better than no data. Also, it’s important to recognize that  we have in some occasions seen social media prove to be self-correcting, as I blogged about here. Finally, we know that information is often perishable in times of crises. By this I mean that crisis data often has a “use-by date” after which, it no longer matters whether said information is true or not. So speed is often vital. This is why semi-automated platforms like SwiftRiver that aim to filter and triangulate social media content can be helpful.