Tag Archives: verification

Accelerating the Verification of Social Media Content

Journalists have already been developing a multitude of tactics to verify user-generated content shared on social media. As noted here, the BBC has a dedicated User-Generated Content (UGC) Hub that is tasked with verifying social media information. The UK Guardian, Al-Jazeera, CNN and others are also developing competency in what I refer to as “information forensics”. It turns out there are many tactics that can be used to try and verify social media content. Indeed, applying most of these existing tactics can be highly time consuming.

So building a decision-tree that combines these tactics is the way to go. But doing digital detective work online is still a time-intensive effort. Numerous pieces of digital evidence need to be collected in order to triangulate and ascertain the veracity of just one given report. We therefore need tools that can accelerate the processing of a verification decision-tree. To be sure, information is the most perishable commodity in a crisis—for both journalists and humanitarian pro-fessionals. This means that after a certain period of time, it no longer matters whether a report has been verified or not because the news cycle or crisis has unfolded further since.

This is why I’m a fan of tools like Rapportive. The point is to have the decision-tree not only serve as an instruction-set on what types of evidence to collect but to actually have a platform that collects that information. There are two general strategies that could be employed to accelerate and scale the verification process. One is to split the tasks listed in the decision-tree into individual micro-tasks that can be distributed and independently completed using crowdsourcing. A second strategy is to develop automated ways to collect the evidence.

Of course, both strategies could also be combined. Indeed, some tasks are far better suited for automation while others can only be carried about by humans. In sum, the idea here is to save journalists and humanitarians time by considerably reducing the time it takes to verify user-generated content posted on social media. I am also particularly interested in gamification approaches to solve major challenges, like the Protein Fold It game. So if you know of any projects seeking to solve the verification challenge described above in novel ways, I’d be very grateful for your input in the comments section below. Thank you!

Using Rapportive for Source and Information Verification

I’ve been using Rapportive for several few weeks now and have found the tool rather useful for assessing the trustworthiness of a source. Rapportive is an extension for Gmail that allows you to automatically visualize an email sender’s complete profile information right inside your inbox.

So now, when receiving emails from strangers, I can immediately see their profile picture, short bio, twitter handle (including latest tweets), links to their Facebook page, Google+ account, LinkedIn profile, blog, SkypeID, recent pictures they’ve posted, etc. As explained in my blog posts on information forensics, this type of meta-data can be particularly useful when assessing the importance or credibility of a source. To be sure, having a source’s entire digital footprint on hand can be quite revealing (as marketers know full well). Moreover, this type of meta-data was exactly what the Standby Volunteer Task Force was manually looking for when they sought to verify the identify of volunteers during the Libya Crisis Map project with the UN last year.

Obviously, the use of Rapportive alone is not a silver bullet to fully determine the credibility of a source or the authenticity of a source’s identity. But it does add contextual information that can make a difference when seeking to better under-stand the reliability of an email. I’d be curious to know whether Rapportive will be available as a stand-alone platform in the future so it can be used outside of Gmail. A simple web-based search box that allows one to search by email address, twitter handle, etc., with the result being a structured profile of that individual’s entire digital footprint. Anyone know whether similar platforms already exist? They could serve as ideal plugins for platforms like CrisisTracker.

Truth in the Age of Social Media: A Social Computing and Big Data Challenge

I have been writing and blogging about “information forensics” for a while now and thus relished Nieman Report’s must-read study on “Truth in the Age of Social Media.” My applied research has specifically been on the use of social media to support humanitarian crisis response (see the multiple links at the end of this blog post). More specifically, my focus has been on crowdsourcing and automating ways to quantify veracity in the social media space. One of the Research & Development projects I am spearheading at the Qatar Computing Research Institute (QCRI) specifically focuses on this hybrid approach. I plan to blog about this research in the near future but for now wanted to share some of the gems in this superb 72-page Nieman Report.

In the opening piece of the report, Craig Silverman writes that “never before in the history of journalism—or society—have more people and organizations been engaged in fact checking and verification. Never has it been so easy to expose an error, check a fact, crowdsource and bring technology to bear in service of verification.” While social media is new, traditional journalistic skills and values are still highly relevant to verification challenges in the social media space. In fact, some argue that “the business of verifying and debunking content from the public relies far more on journalistic hunches than snazzy technology.”

I disagree. This is not an either/or challenge. Social computing can help every-one, not just journalists, develop and test hunches. Indeed, it is imperative that these tools be in the reach of the general public since a “public with the ability to spot a hoax website, verify a tweet, detect a faked photo, and evaluate sources of information is a more informed public. A public more resistant to untruths and so-called rumor bombs.” This public resistance to untruths can itself be moni-tored and modeled to quantify veracity, as this study shows.

David Turner from the BBC writes that “while some call this new specialization in journalism ‘information forensics,’ one does not need to be an IT expert or have special equipment to ask and answer the fundamental questions used to judge whether a scene is staged or not.” No doubt, but as Craig rightly points out, “the complexity of verifying content from myriad sources in various mediums and in real time is one of the great new challenges for the profession.” This is fundamentally a Social Computing, Crowd Computing and Big Data problem. Rumors and falsehoods are treated as bugs or patterns of interference rather than as a feature. The key here is to operate at the aggregate level for statistical purposes and to move beyond the notion of true/false as a dichotomy and to-wards probabilities (think statistical physics). Clustering social media across different media and cross-triangulation using statistical models is one area I find particularly promising.

Furthermore, the fundamental questions used to judge whether or not a scene is staged can be codified. “Old values and skills aren’t still at the core of the discipline.” Indeed, and heuristics based on decades of rich experience in the field of journalism can be coded into social computing algorithms and big data analytics platforms. This doesn’t mean that a fully automated solution should be the goal. The hunch of the expert when combined with the wisdom of the crowd and advanced social computing techniques is far more likely to be effective. As CNN’s Lila King writes, technology may not always be able to “prove if a story is reliable but offers helpful clues.” The quicker we can find those clues, the better.

It is true, as Craig notes, that repressive regimes “create fake videos and images and upload them to YouTube and other websites in the hope that news organizations and the public will find them and take them for real.” It is also true that civil society actors can debunk these falsifications as often I’ve noted in my research. While the report focuses on social media, we must not forget that off-line follow up and investigation is often an option. During the 2010 Egyptian Parliamentary Elections, civil society groups were able to verify 91% of crowd-sourced information in near real time thanks to hyper-local follow up and phone calls. (Incidentally, they worked with a seasoned journalist from Thomson Reuters to design their verification strategies). A similar verification strategy was employed vis-a-vis the atrocities commi-tted in Kyrgyzstan two years ago.

In his chapter on “Detecting Truth in Photos”, Santiago Lyon from the Associated Press (AP) describes the mounting challenges of identifying false or doctored images. “Like other news organizations, we try to verify as best we can that the images portray what they claim to portray. We look for elements that can support authenticity: Does the weather report say that it was sunny at the location that day? Do the shadows fall the right way considering the source of light? Is cloth- ing consistent with what people wear in that region? If we cannot communicate with the videographer or photographer, we will add a disclaimer that says the AP “is unable to independently verify the authenticity, content, location or date of this handout photo/video.”

Santiago and his colleagues are also exploring more automated solutions and believe that “manipulation-detection software will become more sophisticated and useful in the future. This technology, along with robust training and clear guidelines about what is acceptable, will enable media organizations to hold the line against willful image manipulation, thus maintaining their credibility and reputation as purveyors of the truth.”

David Turner’s piece on the BBC’s User-Generated Content (UGC) Hub is also full of gems. “The golden rule, say Hub veterans, is to get on the phone whoever has posted the material. Even the process of setting up the conversation can speak volumes about the source’s credibility: unless sources are activists living in a dictatorship who must remain anonymous.” This was one of the strategies used by Egyptians during the 2010 Parliamentary Elections. Interestingly, many of the anecdotes that David and Santiago share involve members of the “crowd” letting them know that certain information they’ve posted is in fact wrong. Technology could facilitate this process by distributing the challenge of collective debunking in a far more agile and rapid way using machine learning.

This may explain why David expects the field of “information forensics” to becoming industrialized. “By that, he means that some procedures are likely to be carried out simultaneously at the click of an icon. He also expects that technological improvements will make the automated checking of photos more effective. Useful online tools for this are Google’s advanced picture search or TinEye, which look for images similar to the photo copied into the search function.” In addition, the BBC’s UGC Hub uses Google Earth to “confirm that the features of the alleged location match the photo.” But these new technologies should not and won’t be limited to verifying content in only one media but rather across media. Multi-media verification is the way to go.

Journalists like David Turner often (and rightly) note that “being right is more important than being first.” But in humanitarian crises, information is the most perishable of commodities, and being last vis-a-vis information sharing can actual do harm. Indeed, bad information can have far-reaching negative con-sequences, but so can no information. This tradeoff must be weighed carefully in the context of verifying crowdsourced crisis information.

Mark Little’s chapter on “Finding the Wisdom in the Crowd” describes the approach that Storyful takes to verification. “At Storyful, we thinking a com-bination of automation and human skills provides the broadest solution.” Amen. Mark and his team use the phrase “human algorithm” to describe their approach (I use the term Crowd Computing). In age when every news event creates a community, “authority has been replaced by authenticity as the currency of social journalism.” Many of Storyful’s tactics for vetting authenticity are the same we use in crisis mapping when we seek to validate crowdsourced crisis information. These combine the common sense of an investigative journalist with advanced digital literacy.

In her chapter, “Taking on the Rumor Mill,” Katherine Lee rights that a “disaster is ready-made for social media tools, which provide the immediacy needed for reporting breaking news.” She describes the use of these tools during and after the tornado hat hit Alabama in April 2011. What I found particularly interesting was her news team’s decision to “log to probe some of the more persistent rumors, tracking where they might have originated and talking with officials to get the facts. The format fit the nature of the story well. Tracking the rumors, with their ever-changing details, in print would have been slow and awkward, and the blog allowed us to update quickly.” In addition, the blog format “gave readers a space to weigh in with their own evidence, which proved very useful.”

The remaining chapters in the Nieman Report are equally interesting but do not focus on “information forensics” per se. I look forward to sharing more on QCRI’s project on quantifying veracity in the near future as our objective is to learn from experts such as those cited above and codify their experience so we can leverage the latest breakthroughs in social computing and big data analytics to facilitate the verification and validation of crowdsourced social media content. It is worth emphasizing that these codified heuristics cannot and must not remain static, nor can the underlying algorithms become hardwired. More on this in a future post. In the meantime, the following links may be of interest:

  • Information Forensics: Five Case Studies on How to Verify Crowdsourced Information from Social Media (Link)
  • How to Verify and Counter Rumors in Social Media (Link)
  • Data Mining to Verify Crowdsourced Information in Syria (Link)
  • Analyzing the Veracity of Tweets During a Crisis (Link)
  • Crowdsourcing for Human Rights: Challenges and Opportunities for Information Collection & Verification (Link)
  • Truthiness as Probability: Moving Beyond the True or False Dichotomy when Verifying Social Media (Link)
  • The Crowdsourcing Detective: Crisis, Deception and Intrigue in the Twittersphere (Link)
  • Crowdsourcing Versus Putin (Link)
  • Wiki on Truthiness resources (Link)
  • My TEDx Talk: From Photosynth to ALLsynth (Link)
  • Social Media and Life Cycle of Rumors during Crises (Link)
  • Wag the Dog, or How Falsifying Crowdsourced Data Can Be a Pain (Link)

Crowdsourcing for Human Rights Monitoring: Challenges and Opportunities for Information Collection & Verification

This new book, Human Rights and Information Communication Technologies: Trends and Consequences of Use, promises to be a valuable resource to both practitioners and academics interested in leveraging new information & communication technologies (ICTs) in the context of human rights work. I had the distinct pleasure of co-authoring a chapter for this book with my good colleague and friend Jessica Heinzelman. We focused specifically on the use of crowdsourcing and ICTs for information collection and verification. Below is the Abstract & Introduction for our chapter.

Abstract

Accurate information is a foundational element of human rights work. Collecting and presenting factual evidence of violations is critical to the success of advocacy activities and the reputation of organizations reporting on abuses. To ensure credibility, human rights monitoring has historically been conducted through highly controlled organizational structures that face mounting challenges in terms of capacity, cost and access. The proliferation of Information and Communication Technologies (ICTs) provide new opportunities to overcome some of these challenges through crowdsourcing. At the same time, however, crowdsourcing raises new challenges of verification and information overload that have made human rights professionals skeptical of their utility. This chapter explores whether the efficiencies gained through an open call for monitoring and reporting abuses provides a net gain for human rights monitoring and analyzes the opportunities and challenges that new and traditional methods pose for verifying crowdsourced human rights reporting.

Introduction

Accurate information is a foundational element of human rights work. Collecting and presenting factual evidence of violations is critical to the success of advocacy activities and the reputation of organizations reporting on abuses. To ensure credibility, human rights monitoring has historically been conducted through highly controlled organizational structures that face mounting challenges in terms of capacity, cost and access.

The proliferation of Information and Communication Technologies (ICTs) may provide new opportunities to overcome some of these challenges. For example, ICTs make it easier to engage large networks of unofficial volunteer monitors to crowdsource the monitoring of human rights abuses. Jeff Howe coined the term “crowdsourcing” in 2006, defining it as “the act of taking a job traditionally performed by a designated agent and outsourcing it to an undefined, generally large group of people in the form of an open call” (Howe, 2009). Applying this concept to human rights monitoring, Molly Land (2009) asserts that, “given the limited resources available to fund human rights advocacy…amateur involvement in human rights activities has the potential to have a significant impact on the field” (p. 2). That said, she warns that professionalization in human rights monitoring “has arisen not because of an inherent desire to control the process, but rather as a practical response to the demands of reporting – namely, the need to ensure the accuracy of the information contained in the report” (Land, 2009, p. 3).

Because “accuracy is the human rights monitor’s ultimate weapon” and the advocate’s “ability to influence governments and public opinion is based on the accuracy of their information,” the risk of inaccurate information may trump any advantages gained through crowdsourcing (Codesria & Amnesty International, 2000, p. 32). To this end, the question facing human rights organizations that wish to leverage the power of the crowd is “whether [crowdsourced reports] can accomplish the same [accurate] result without a centralized hierarchy” (Land, 2009). The answer to this question depends on whether reliable verification techniques exist so organizations can use crowdsourced information in a way that does not jeopardize their credibility or compromise established standards. While many human rights practitioners (and indeed humanitarians) still seem to be allergic to the term crowdsourcing, further investigation reveals that established human rights organizations already use crowdsourcing and verification techniques to validate crowdsourced information and that there is great potential in the field for new methods of information collection and verification.

This chapter analyzes the opportunities and challenges that new and traditional methods pose for verifying crowdsourced human rights reporting. The first section reviews current methods for verification in human rights monitoring. The second section outlines existing methods used to collect and validate crowdsourced human rights information. Section three explores the practical opportunities that crowdsourcing offers relative to traditional methods. The fourth section outlines critiques and solutions for crowdsourcing reliable information. The final section proposes areas for future research.

The book is available for purchase here. Warning: you won’t like the price but at least they’re taking an iTunes approach, allowing readers to purchase single chapters if they prefer. Either way, Jess and I were not paid for our contribution.

For more information on how to verify crowdsourced information, please visit the following links:

  • Information Forensics: Five Case Studies on How to Verify Crowdsourced Information from Social Media (Link)
  • How to Verify and Counter Rumors in Social Media (Link)
  • Social Media and Life Cycle of Rumors during Crises (Link)
  • Truthiness as Probability: Moving Beyond the True or False Dichotomy when Verifying Social Media (Link)
  • Crowdsourcing Versus Putin (Link)
 

 

Crisis Mapping Syria: Automated Data Mining and Crowdsourced Human Intelligence

The Syria Tracker Crisis Map is without doubt one of the most impressive crisis mapping projects yet. Launched just a few weeks after the protests began one year ago, the crisis map is spearheaded by a just handful of US-based Syrian activists have meticulously and systematically documented 1,529 reports of human rights violations including a total of 11,147 killings. As recently reported in this NewScientist article, “Mapping the Human Cost of Syria’s Uprising,” the crisis map “could be the most accurate estimate yet of the death toll in Syria’s uprising […].” Their approach? “A combination of automated data mining and crowdsourced human intelligence,” which “could provide a powerful means to assess the human cost of wars and disasters.”

On the data-mining side, Syria Tracker has repurposed the HealthMap platform, which mines thousands of online sources for the purposes of disease detection and then maps the results, “giving public-health officials an easy way to monitor local disease conditions.” The customized version of this platform for Syria Tracker (ST), known as HealthMap Crisis, mines English information sources for evidence of human rights violations, such as killings, torture and detainment. As the ST Team notes, their data mining platform “draws from a broad range of sources to reduce reporting biases.” Between June 2011 and January 2012, for example, the platform collected over 43,o00 news articles and blog posts from almost 2,000 English-based sources from around the world (including some pro-regime sources).

Syria Tracker combines the results of this sophisticated data mining approach with crowdsourced human intelligence, i.e., field-based eye-witness reports shared via webform, email, Twitter, Facebook, YouTube and voicemail. This naturally presents several important security issues, which explains why the main ST website includes an instructions page detailing security precautions that need to be taken while sub-mitting reports from within Syria. They also link to this practical guide on how to protect your identity and security online and when using mobile phones. The guide is available in both English and Arabic.

Eye-witness reports are subsequently translated, geo-referenced, coded and verified by a group of volunteers who triangulate the information with other sources such as those provided by the HealthMap Crisis platform. They also filter the reports and remove dupli-cates. Reports that have a low con-fidence level vis-a-vis veracity are also removed. Volunteers use a dig-up or vote-up/vote-down feature to “score” the veracity of eye-witness reports. Using this approach, the ST Team and their volunteers have been able to verify almost 90% of the documented killings mapped on their platform thanks to video and/or photographic evidence. They have also been able to associate specific names to about 88% of those reported killed by Syrian forces since the uprising began.

Depending on the levels of violence in Syria, the turn-around time for a report to be mapped on Syria Tracker is between 1-3 days. The team also produces weekly situation reports based on the data they’ve collected along with detailed graphical analysis. KML files that can be uploaded and viewed using Google Earth are also made available on a regular basis. These provide “a more precisely geo-located tally of deaths per location.”

In sum, Syria Tracker is very much breaking new ground vis-a-vis crisis mapping. They’re combining automated data mining technology with crowdsourced eye-witness reports from Syria. In addition, they’ve been doing this for a year, which makes the project the longest running crisis maps I’ve seen in a hostile environ-ment. Moreover, they’ve been able to sustain these import efforts with just a small team of volunteers. As for the veracity of the collected information, I know of no other public effort that has taken such a meticulous and rigorous approach to documenting the killings in Syria in near real-time. On February 24th, Al-Jazeera posted the following estimates:

Syrian Revolution Coordination Union: 9,073 deaths
Local Coordination Committees: 8,551 deaths
Syrian Observatory for Human Rights: 5,581 deaths

At the time, Syria Tracker had a total of 7,901 documented killings associated with specific names, dates and locations. While some duplicate reports may remain, the team argues that “missing records are a much bigger source of error.” Indeed, They believe that “the higher estimates are more likely, even if one chooses to disregard those reports that came in on some of the most violent days where names were not always recorded.”

The Syria Crisis Map itself has been viewed by visitors from 136 countries around the world and 2,018 cities—with the top 3 cities being Damascus, Washington DC and, interestingly, Riyadh, Saudia Arabia. The witnessing has thus been truly global and collective. When the Syrian regime falls, “the data may help sub-sequent governments hold him and other senior leaders to account,” writes the New Scientist. This was one of the principle motivations behind the launch of the Ushahidi platform in Kenya over four years ago. Syria Tracker is powered by Ushahidi’s cloud-based platform, Crowdmap. Finally, we know for a fact that the International Criminal Court (ICC) and Amnesty International (AI) closely followed the Libya Crisis Map last year.

Truthiness as Probability: Moving Beyond the True or False Dichotomy when Verifying Social Media

I asked the following question at the Berkman Center’s recent Symposium on Truthiness in Digital Media: “Should we think of truthiness in terms of probabili-ties rather than use a True or False dichotomy?” The wording here is important. The word “truthiness” already suggests a subjective fuzziness around the term. Expressing truthiness as probabilities provides more contextual information than does a binary true or false answer.

When we set out to design the SwiftRiver platform some three years ago, it was already clear to me then that the veracity of crowdsourced information ought to be scored in terms of probabilities. For example, what is the probability that the content of a Tweet referring to the Russian elections is actually true? Why use probabilities? Because it is particularly challenging to instantaneously verify crowdsourced information in the real-time social media world we live in.

There is a common tendency to assume that all unverified information is false until proven otherwise. This is too simplistic, however. We need a fuzzy logic approach to truthiness:

“In contrast with traditional logic theory, where binary sets have two-valued logic: true or false, fuzzy logic variables may have a truth value that ranges in degree between 0 and 1. Fuzzy logic has been extended to handle the concept of partial truth, where the truth value may range between completely true and completely false.”

The majority of user-generated content is unverified at time of birth. (Does said data deserve the “original sin” of being labeled as false, unworthy, until prove otherwise? To digress further, unverified content could be said to have a distinct wave function that enables said data to be both true and false until observed. The act of observation starts the collapse of said wave function. To the astute observer, yes, I’m riffing off Shroedinger’s Cat, and was also pondering how to weave in Heisenberg’s uncertainty principle as an analogy; think of a piece of information characterized by a “probability cloud” of truthiness).

I believe the hard sciences have much to offer in this respect. Why don’t we have error margins for truthiness? Why not take a weather forecast approach to information truthiness in social media? What if we had a truthiness forecast understanding full well that weather forecasts are not always correct? The fact that a 70% chance of rain is forecasted doesn’t prevent us from acting and using that forecast to inform our decision-making. If we applied binary logic to weather forecasts, we’d be left with either a 100% chance of rain or 100% chance of sun. Such weather forecasts would be at best suspect if not wrong rather frequently.

In any case, instead of dismissing content generated in real-time because it is not immediately verifiable, we can draw on Information Forensics to begin assessing the potential validity of said content. Tactics from information forensics can help us create a score card of heuristics to express truthiness in terms of probabilities. (I call this advanced media literacy). There are indeed several factors that one can weigh, e.g., the identity of the messenger relaying the content, the source of the content, the wording of said content, the time of day the information was shared, the geographical proximity of the source to the event being reported, etc.

These weights need not be static as they are largely subjective and temporal; after all, truth is socially constructed and dynamic. So while a “wisdom of the crowds” approach alone may not always be well-suited to generating these weights, perhaps integrating the hunch of the expert coupled with machine learning algorithms (based on lessons learned in information forensics) could result more useful decision-support tools for truthiness forecasting (or rather “backcasting”).

In sum, thinking of truthiness strictly in terms of true and false prevents us from “complexifying” a scalar variable into a vector (a wave function), which in turn limits our ability to develop new intervention strategies. We need new conceptual frameworks to reflect the complexity and ambiguity of user-generated content:

 

Trails of Trustworthiness in Real-Time Streams

Real-time information channels like Twitter, Facebook and Google have created cascades of information that are becoming increasingly challenging to navigate. “Smart-filters” alone are not the solution since they won’t necessarily help us determine the quality and trustworthiness of the information we receive. I’ve been studying this challenge ever since the idea behind SwiftRiver first emerged several years ago now.

I was thus thrilled to come across a short paper on “Trails of Trustworthiness in Real-Time Streams” which describes a start-up project that aims to provide users with a “system that can maintain trails of trustworthiness propagated through real-time information channels,” which will “enable its educated users to evaluate its provenance, its credibility and the independence of the multiple sources that may provide this information.” The authors, Panagiotis Metaxas and Eni Mustafaraj, kindly cite my paper on “Information Forensics” and also reference SwiftRiver in their conclusion.

The paper argues that studying the tactics that propagandists employ in real life can provide insights and even predict the tricks employed by Web spammers.

“To prove the strength of this relationship between propagandistic and spamming techniques, […] we show that one can, in fact, use anti-propagandistic techniques to discover Web spamming networks. In particular, we demonstrate that when starting from an initial untrustworthy site, backwards propagation of distrust (looking at the graph defined by links pointing to to an untrustworthy site) is a successful approach to finding clusters of spamming, untrustworthy sites. This approach was inspired by the social behavior associated with distrust: in society, recognition of an untrustworthy entity (person, institution, idea, etc) is reason to question the trust- worthiness of those who recommend it. Other entities that are found to strongly support untrustworthy entities become less trustworthy themselves. As in society, distrust is also propagated backwards on the Web graph.”

The authors document that today’s Web spammers are using increasingly sophisticated tricks.

“In cases where there are high stakes, Web spammers’ influence may have important consequences for a whole country. For example, in the 2006 Congressional elections, activists using Google bombs orchestrated an effort to game search engines so that they present information in the search results that was unfavorable to 50 targeted candidates. While this was an operation conducted in the open, spammers prefer to work in secrecy so that their actions are not revealed. So,  revealed and documented the first Twitter bomb, which tried to influence the Massachusetts special elections, show- ing how an Iowa-based political group, hiding its affiliation and profile, was able to serve misinformation a day before the election to more than 60,000 Twitter users that were follow- ing the elections. Very recently we saw an increase in political cybersquatting, a phenomenon we reported in [28]. And even more recently, […] we discovered the existence of Pre-fabricated Twitter factories, an effort to provide collaborators pre-compiled tweets that will attack members of the Media while avoiding detection of automatic spam algorithms from Twitter.

The theoretical foundations for a trustworthiness system:

“Our concept of trustworthiness comes from the epistemology of knowledge. When we believe that some piece of information is trustworthy (e.g., true, or mostly true), we do so for intrinsic and/or extrinsic reasons. Intrinsic reasons are those that we acknowledge because they agree with our own prior experience or belief. Extrinsic reasons are those that we accept because we trust the conveyor of the information. If we have limited information about the conveyor of information, we look for a combination of independent sources that may support the information we receive (e.g., we employ “triangulation” of the information paths). In the design of our system we aim to automatize as much as possible the process of determining the reasons that support the information we receive.”

“We define as trustworthy, information that is deemed reliable enough (i.e., with some probability) to justify action by the receiver in the future. In other words, trustworthiness is observable through actions.”

“The overall trustworthiness of the information we receive is determined by a linear combination of (a) the reputation RZ of the original sender Z, (b) the credibility we associate with the contents of the message itself C(m), and (c) characteristics of the path that the message used to reach us.”

“To compute the trustworthiness of each message from scratch is clearly a huge task. But the research that has been done so far justifies optimism in creating a semi-automatic, personalized tool that will help its users make sense of the information they receive. Clearly, no such system exists right now, but components of our system do exist in some of the popular [real-time information channels]. For a testing and evaluation of our system we plan to use primarily Twitter, but also real-time Google results and Facebook.”

In order to provide trails of trustworthiness in real-time streams, the authors plan to address the following challenges:

•  “Establishment of new metrics that will help evaluate the trustworthiness of information people receive, especially from real-time sources, which may demand immediate attention and action. […] we show that coverage of a wider range of opinions, along with independence of results’ provenance, can enhance the quality of organic search results. We plan to extend this work in the area of real-time information so that it does not rely on post-processing procedures that evaluate quality, but on real-time algorithms that maintain a trail of trustworthiness for every piece of information the user receives.”

• “Monitor the evolving ways in which information reaches users, in particular citizens near election time.”

•  “Establish a personalizable model that captures the parameters involved in the determination of trustworthiness of in- formation in real-time information channels, such as Twitter, extending the work of measuring quality in more static information channels, and by applying machine learning and data mining algorithms. To implement this task, we will design online algorithms that support the determination of quality via the maintenance of trails of trustworthiness that each piece of information carries with it, either explicitly or implicitly. Of particular importance, is that these algorithms should help maintain privacy for the user’s trusting network.”

• “Design algorithms that can detect attacks on [real-time information channels]. For example we can automatically detect bursts of activity re- lated to a subject, source, or non-independent sources. We have already made progress in this area. Recently, we advised and provided data to a group of researchers at Indiana University to help them implement “truthy”, a site that monitors bursty activity on Twitter.  We plan to advance, fine-tune and automate this process. In particular, we will develop algorithms that calculate the trust in an information trail based on a score that is affected by the influence and trustworthiness of the informants.”

In conclusion, the authors “mention that in a month from this writing, Ushahidi […] plans to release SwiftRiver, a platform that ‘enables the filtering and verification of real-time data from channels like Twitter, SMS, Email and RSS feeds’. Several of the features of Swift River seem similar to what we propose, though a major difference appears to be that our design is personalization at the individual user level.”

Indeed, having been involved in SwiftRiver research since early 2009 and currently testing the private beta, there are important similarities and some differences. But one such difference is not personalization. Indeed, Swift allows full personalization at the individual user level.

Another is that we’re hoping to go beyond just text-based information with Swift, i.e., we hope to pull in pictures and video footage (in addition to Tweets, RSS feeds, email, SMS, etc) in order to cross-validate information across media, which we expect will make the falsification of crowdsourced information more challenging, as I argue here. In any case, I very much hope that the system being developed by the authors will be free and open source so that integration might be possible.

A copy of the paper is available here (PDF). I hope to meet the authors at the Berkman Center’s “Truth in Digital Media Symposium” and highly recommend the wiki they’ve put together with additional resources. I’ve added the majority of my research on verification of crowdsourced information to that wiki, such as my 20-page study on “Information Forensics: Five Case Studies on How to Verify Crowdsourced Information from Social Media.”