Real-time information channels like Twitter, Facebook and Google have created cascades of information that are becoming increasingly challenging to navigate. “Smart-filters” alone are not the solution since they won’t necessarily help us determine the quality and trustworthiness of the information we receive. I’ve been studying this challenge ever since the idea behind SwiftRiver first emerged several years ago now.
I was thus thrilled to come across a short paper on “Trails of Trustworthiness in Real-Time Streams” which describes a start-up project that aims to provide users with a “system that can maintain trails of trustworthiness propagated through real-time information channels,” which will “enable its educated users to evaluate its provenance, its credibility and the independence of the multiple sources that may provide this information.” The authors, Panagiotis Metaxas and Eni Mustafaraj, kindly cite my paper on “Information Forensics” and also reference SwiftRiver in their conclusion.
The paper argues that studying the tactics that propagandists employ in real life can provide insights and even predict the tricks employed by Web spammers.
“To prove the strength of this relationship between propagandistic and spamming techniques, […] we show that one can, in fact, use anti-propagandistic techniques to discover Web spamming networks. In particular, we demonstrate that when starting from an initial untrustworthy site, backwards propagation of distrust (looking at the graph defined by links pointing to to an untrustworthy site) is a successful approach to finding clusters of spamming, untrustworthy sites. This approach was inspired by the social behavior associated with distrust: in society, recognition of an untrustworthy entity (person, institution, idea, etc) is reason to question the trust- worthiness of those who recommend it. Other entities that are found to strongly support untrustworthy entities become less trustworthy themselves. As in society, distrust is also propagated backwards on the Web graph.”
The authors document that today’s Web spammers are using increasingly sophisticated tricks.
“In cases where there are high stakes, Web spammers’ influence may have important consequences for a whole country. For example, in the 2006 Congressional elections, activists using Google bombs orchestrated an effort to game search engines so that they present information in the search results that was unfavorable to 50 targeted candidates. While this was an operation conducted in the open, spammers prefer to work in secrecy so that their actions are not revealed. So, revealed and documented the first Twitter bomb, which tried to influence the Massachusetts special elections, show- ing how an Iowa-based political group, hiding its affiliation and profile, was able to serve misinformation a day before the election to more than 60,000 Twitter users that were follow- ing the elections. Very recently we saw an increase in political cybersquatting, a phenomenon we reported in . And even more recently, […] we discovered the existence of Pre-fabricated Twitter factories, an effort to provide collaborators pre-compiled tweets that will attack members of the Media while avoiding detection of automatic spam algorithms from Twitter.
The theoretical foundations for a trustworthiness system:
“Our concept of trustworthiness comes from the epistemology of knowledge. When we believe that some piece of information is trustworthy (e.g., true, or mostly true), we do so for intrinsic and/or extrinsic reasons. Intrinsic reasons are those that we acknowledge because they agree with our own prior experience or belief. Extrinsic reasons are those that we accept because we trust the conveyor of the information. If we have limited information about the conveyor of information, we look for a combination of independent sources that may support the information we receive (e.g., we employ “triangulation” of the information paths). In the design of our system we aim to automatize as much as possible the process of determining the reasons that support the information we receive.”
“We define as trustworthy, information that is deemed reliable enough (i.e., with some probability) to justify action by the receiver in the future. In other words, trustworthiness is observable through actions.”
“The overall trustworthiness of the information we receive is determined by a linear combination of (a) the reputation RZ of the original sender Z, (b) the credibility we associate with the contents of the message itself C(m), and (c) characteristics of the path that the message used to reach us.”
“To compute the trustworthiness of each message from scratch is clearly a huge task. But the research that has been done so far justifies optimism in creating a semi-automatic, personalized tool that will help its users make sense of the information they receive. Clearly, no such system exists right now, but components of our system do exist in some of the popular [real-time information channels]. For a testing and evaluation of our system we plan to use primarily Twitter, but also real-time Google results and Facebook.”
In order to provide trails of trustworthiness in real-time streams, the authors plan to address the following challenges:
• “Establishment of new metrics that will help evaluate the trustworthiness of information people receive, especially from real-time sources, which may demand immediate attention and action. […] we show that coverage of a wider range of opinions, along with independence of results’ provenance, can enhance the quality of organic search results. We plan to extend this work in the area of real-time information so that it does not rely on post-processing procedures that evaluate quality, but on real-time algorithms that maintain a trail of trustworthiness for every piece of information the user receives.”
• “Monitor the evolving ways in which information reaches users, in particular citizens near election time.”
• “Establish a personalizable model that captures the parameters involved in the determination of trustworthiness of in- formation in real-time information channels, such as Twitter, extending the work of measuring quality in more static information channels, and by applying machine learning and data mining algorithms. To implement this task, we will design online algorithms that support the determination of quality via the maintenance of trails of trustworthiness that each piece of information carries with it, either explicitly or implicitly. Of particular importance, is that these algorithms should help maintain privacy for the user’s trusting network.”
• “Design algorithms that can detect attacks on [real-time information channels]. For example we can automatically detect bursts of activity re- lated to a subject, source, or non-independent sources. We have already made progress in this area. Recently, we advised and provided data to a group of researchers at Indiana University to help them implement “truthy”, a site that monitors bursty activity on Twitter. We plan to advance, fine-tune and automate this process. In particular, we will develop algorithms that calculate the trust in an information trail based on a score that is affected by the influence and trustworthiness of the informants.”
In conclusion, the authors “mention that in a month from this writing, Ushahidi […] plans to release SwiftRiver, a platform that ‘enables the filtering and verification of real-time data from channels like Twitter, SMS, Email and RSS feeds’. Several of the features of Swift River seem similar to what we propose, though a major difference appears to be that our design is personalization at the individual user level.”
Indeed, having been involved in SwiftRiver research since early 2009 and currently testing the private beta, there are important similarities and some differences. But one such difference is not personalization. Indeed, Swift allows full personalization at the individual user level.
Another is that we’re hoping to go beyond just text-based information with Swift, i.e., we hope to pull in pictures and video footage (in addition to Tweets, RSS feeds, email, SMS, etc) in order to cross-validate information across media, which we expect will make the falsification of crowdsourced information more challenging, as I argue here. In any case, I very much hope that the system being developed by the authors will be free and open source so that integration might be possible.
A copy of the paper is available here (PDF). I hope to meet the authors at the Berkman Center’s “Truth in Digital Media Symposium” and highly recommend the wiki they’ve put together with additional resources. I’ve added the majority of my research on verification of crowdsourced information to that wiki, such as my 20-page study on “Information Forensics: Five Case Studies on How to Verify Crowdsourced Information from Social Media.”