Dirty data costs US enterprises $600 billion annually. Errors in data lead to problems like hampering service offerings and loss of revenues, credibility and customers. There is no reason to believe that the scale of the problem is any different in the UK or any other society that is dependent on information technology.Enterprises are not the only victims of poor data: inaccuracy and discrepancies in scientific data also have severe consequences on the quality of the scientific research.While errors and inconsistencies may be introduced for many reasons in various stages when data is processed, data quality problems are particularly evident when overlapping or redundant information from multiple heterogeneous sources is integrated.With this comes the need for cleaning integrated data: detect and remove errors, conflicts and inconsistencies in data integrated from multiple sources, in order to improve the quality of the integrated data. However important, no practical system is yet in place for effectively cleaning data integrated from multiple sources, in either traditional database or XML format. Data integration and cleaning, already highly challenging when considered separately, are significantly more difficult when being dealt with together. A number of intriguing problems associated with data integration and cleaning remain open or unexplored. Open issues include: how to capture the quality of the data (the consistency, accuracy, timeliness and completeness of the data)? How to automatically integrate data? In the presence of inconsistencies, how to choose data from the most accurate and up-to-date sources? How to effectively detect and remove inconsistencies in the integrated data? In response to the compelling need from both industry and scientific data management, this project will develop a principled basis and working tools for integrating and cleaning data. It will provide a new model, reasoning systems and complexity bounds for the analysis of data quality, as well as practical (approximation or heuristic) algorithms and techniques for conducing lossless integration, inconsistency detection and reconciliation, for data in traditional databases or XML format. The novelty of the proposed research consists in the following.1. Novel constraints to specify the consistency of data, data provenance analysis to determine and keep track of the accuracy and timeliness of the data, and lossless schema mappings to deal with the completeness of the data.2. Automatic generation of lossless schema mappings in parallel with data provenance analysis.3. Practical techniques for reasoning about constraints, including the analysis of constraint propagation from sources to the integrated data, in order to discover constraints on the integrated data and eliminate redundancies.4. Efficient detection of errors, conflicts or inconsistencies in integrated data, based on automatic generation of detecting queries.5. Effective methods for removing errors and inconsistencies from integrated data, based on the accuracy and timeliness of the data.The project will lead to the first uniform system that integrates and cleans data. It will produce quality research results that are of considerable interest to both international database theory and system communities, and beyond. The results will be published in top computer science journals and major international database conferences. This project involves extensive collaboration between the University of Edinburgh and Bell Labs, Lucent Technologies. Bell Labs will provide a testbed for deploying and evaluating the system and tools to be developed, using real-world data from Lucent. Thus the project will generate immediate impact on industry. The tools will also find immediate applications in scientific data management, e.g., in the Generation Scotland project, which is a partnership between academics and the National Health Service in Scotland.
|