EPSRC Reference: |
EP/C010035/1 |
Title: |
Extracting the Science from Scientific Publications |
Principal Investigator: |
Copestake, Professor A |
Other Investigators: |
|
Researcher Co-Investigators: |
|
Project Partners: |
|
Department: |
Computer Science and Technology |
Organisation: |
University of Cambridge |
Scheme: |
Standard Research (Pre-FEC) |
Starts: |
01 November 2005 |
Ends: |
30 April 2010 |
Value (£): |
676,835
|
EPSRC Research Topic Classifications: |
Artificial Intelligence |
Comput./Corpus Linguistics |
Intelligent & Expert Systems |
|
|
EPSRC Industrial Sector Classifications: |
|
Related Grants: |
|
Panel History: |
|
Summary on Grant Application Form |
Many tools exist for processing natural languages, such as English, but there is no single perfect system. Different approaches have different strengths and weaknesses. For instance, some very fast processors are designed to make decisions about part of speech: e.g., that `fly' in the sentence `You'll have to fly' is a verb rather than a noun. Other processors can do much more: e.g., they realise that `you' will be doing the flying, and may be able to decide whether `fly' is meant literally or idiomatically (in context). But such `deep' systems are much slower at processing text and far more complex to build than the simpler `shallow' systems. Therefore researchers in natural language processing try to combine multiple systems in different ways, in particular so that deep systems are only used on text that is identified as interesting by a shallow system. However, progress has been hindered by the lack of a common interface between systems.We are developing a formal language which captures some aspects of the meaning of natural language in a way that allows contributions from different processors to be combined. The combined systems can be used to extract knowledge from text for later machine use, or to give human browsers information about the structure of texts and their interconnections. In this project, we will use this approach to analyse research papers in Chemistry, so that aspects of their meaning can be extracted and used in the Semantic Web. For example, we can obtain information about how particular compounds are synthesised and represent this so that researchers can look up the information more easily. We are also trying to automatically discover information about the meaning of terms used in Chemistry. For instance, our system might discover that `an alkaloid is a type of azacycle' from the phrase `the concise synthesis of naturally occurring alkaloids and other complex polycyclic azacycles'. We will also analyse text structure so that we can tell whether an author is agreeing with a previous publication or criticising it.These tools will be combined in a complete system for use by working chemists who will give us feedback on the results. We are collaborating with major publishers who are allowing us to experiment with papers in their collections. We expect to use a GRID of parallel computers to process tens of thousands of papers in order to build a substantial knowledge base. At the end of the project, we will investigate the extension of this work to other sciences. However, the general approach will have wide application to extraction of information from many types of text.
|
Key Findings |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Potential use in non-academic contexts |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Impacts |
Description |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk |
Summary |
|
Date Materialised |
|
|
Sectors submitted by the Researcher |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Project URL: |
|
Further Information: |
|
Organisation Website: |
http://www.cam.ac.uk |