EPSRC Reference: |
EP/J020664/1 |
Title: |
CROSS: Real-time Story Detection Across Multiple Massive Streams |
Principal Investigator: |
Osborne, Dr M |
Other Investigators: |
|
Researcher Co-Investigators: |
|
Project Partners: |
|
Department: |
Sch of Informatics |
Organisation: |
University of Edinburgh |
Scheme: |
Standard Research |
Starts: |
01 May 2012 |
Ends: |
30 April 2013 |
Value (£): |
209,736
|
EPSRC Research Topic Classifications: |
Information & Knowledge Mgmt |
|
|
EPSRC Industrial Sector Classifications: |
Aerospace, Defence and Marine |
Information Technologies |
|
Related Grants: |
|
Panel History: |
Panel Date | Panel Name | Outcome |
09 Feb 2012
|
Data Intensive Systems (DaISy)
|
Announced
|
|
Summary on Grant Application Form |
The World is rapidly becoming more and more connected, with people communicating using multiple streams - Social Media, Newswire, Wikipedia etc - on a bewildering range of topics and at a furious rate. Twitter alone receives more than 250 million new posts every day (Tsotsis 2011). This massive interconnection means that content can appear and quickly spread through and across different streams. For example, in the recent London riots, many tweets reported the rioting events as they happened in real-time. However, not all content posted is either of good quality or is factually correct, complicating the job of monitoring such streams for any purpose. An example of this happened when a comedian spread false rumours on Twitter about Osama Bin Laden watching his television show (Lineham 2011). Communication streams are also known to spread rumours, outright misinformation and content with malicious intent. For instance, during the same riots, radicalising posts were spread calling for participation in the so-called "cyber-jihad" (BBC 2011). Systems that can identify such posts is of paramount importance for security monitoring purposes.
On the other hand, not all information spread on mediums such as Twitter are accurate or interesting. This is compounded with the peculiarities of messages on modern social media (short, jargon, social context, etc.) where biased, incomplete, inaccurate and misleading messages are common. The latter makes it extremely challenging to automatically identify events worth monitoring for security purposes in real-time.
We propose a distributed infrastructure to automatically identify important new events (aka stories) in real-time by combining and comparing multiple message streams. The value of such story detection to many applications is clearly increased the faster this can happen. A security agency using our system would be better prepared when dealing with fast moving events as they unfold. Indeed, in this project, the notion of importance will be defined within a security context. Given the fact that streams typically have possible bias and not everything present can be trusted, a key requirement of the system is minimising false positives (uninteresting stories that are discovered). Moreover, the effective management and efficient processing of multiple streams of real-time data poses new technological and scientific challenges:
Challenge 1: Identify interesting new stories and not drown in a sea of false positives, yet reduce the effects of bias and rumour.
Challenge 2: Minimise system latency, such that new stories are detected in real-time with low latency.
We tackle the first challenge from the novel perspective of processing multiple streams and exploiting the fact that stories reported multiple times across several streams can cancel-out stream-specific bias and errors. For example, if a story is true, then it is more likely that it manifests in both Twitter and as an update to a Wikipedia article. Alternatively, a story might appear in Twitter and also appear in a governmental cable. The more often a story occurs within and across streams, the more likely it will be interesting. This is the cornerstone of our proposal, which we tackle by building upon modern first story detection techniques, adapted to account for bias and rumours.
In the second challenge, we ensure low-latency story detection by using a distributed real-time data processing architecture (e.g. S4 or Storm), similar to MapReduce but better suited for real-time operations. Real-time architectures for dealing with massive-scale data are in their infancy, hence CROSS will present a first concrete application, with a corresponding development of best practices for such architectures.
|
Key Findings |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Potential use in non-academic contexts |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Impacts |
Description |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk |
Summary |
|
Date Materialised |
|
|
Sectors submitted by the Researcher |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Project URL: |
|
Further Information: |
|
Organisation Website: |
http://www.ed.ac.uk |