EPSRC Reference: |
EP/F02682X/1 |
Title: |
Pattern matching algorithms for massive datasets |
Principal Investigator: |
Clifford, Dr R |
Other Investigators: |
|
Researcher Co-Investigators: |
|
Project Partners: |
|
Department: |
Computer Science |
Organisation: |
University of Bristol |
Scheme: |
First Grant Scheme |
Starts: |
18 February 2008 |
Ends: |
17 August 2011 |
Value (£): |
278,545
|
EPSRC Research Topic Classifications: |
Fundamentals of Computing |
|
|
EPSRC Industrial Sector Classifications: |
No relevance to Underpinning Sectors |
|
|
Related Grants: |
|
Panel History: |
Panel Date | Panel Name | Outcome |
18 Oct 2007
|
ICT Prioritisation Panel (Technology)
|
Announced
|
|
Summary on Grant Application Form |
This project aims to provide the tools necessary for pattern matchingin massive datasets in the 21st century. Pattern matching problemsare pervasive and it is therefore hard to overstate theirimportance. The hugely successful new field of high throughputcomputational genetics, which is the lifeblood of pharmaceuticalindustries, is founded on the ability to perform approximate stringmatching accurately and quickly. Perhaps more mundanely but no lesssignificant economically, linear time exact matching algorithms arenow taken for granted as basic tools in every text editor and wordprocessor used today.Despite the success that pattern matching algorithms continue toenjoy, new problems in urgent need of a solution arise continually.These revolve around data processing applications where the datasetsare massive, subject to error or ambiguity and where processing isrequired online or in real-time. For example, the problem of exactmatching has well known optimal solutions both for online search andwhen the data to be queried can be indexed beforehand. However,unlike exact matching, the problem of finding the fastest algorithmsfor approximate matching has still not been resolved under almost anymeasure of similarity. Where the data is of a non-standard form, forexample consisting of numerical rather than symbolic information, evenless is currently known about how to search or index the informationefficiently.Another vital difference between the old and new settings is notsimply in the quantity of data available but also the ways in which ithas to be processed. The public genome sequencing projects, forexample, have produced 100s of gigabytes of sequence and related metadata. However these datasets are relatively straightforward to handlecompared to the processing of information passing through Internetrouters and over telephone wires every day or stored in the World WideWeb. In this situation it is not sufficient simply that an algorithmsruns fast. Ideally it should also require considerably less space thanthe input, update at least as quickly as the new data are arriving andstill be able to perform complex queries on the whole dataset.This project will directly address these two separate but interrelatedchallenges. Real-time and online matching algorithms will be developedto handle situations where vast amounts of data are streaming past atvery high rates. The project will also consider new forms ofapproximation and present fast algorithmic solutions that will allowdatasets that result from modern applications and industries to besearched for approximate matches without the need to rely onheuristics. Finally as part of the work on improved methods forapproximate matching, this project will develop faster and smallerindexes that will for the first time allow approximate matching ontruly massive datasets to become feasible in practice.
|
Key Findings |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Potential use in non-academic contexts |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Impacts |
Description |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk |
Summary |
|
Date Materialised |
|
|
Sectors submitted by the Researcher |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Project URL: |
|
Further Information: |
|
Organisation Website: |
http://www.bris.ac.uk |