EPSRC logo

Details of Grant 

EPSRC Reference: EP/H050442/1
Title: Word segmentation from noisy data with minimal supervision
Principal Investigator: Goldwater, Professor S
Other Investigators:
Researcher Co-Investigators:
Project Partners:
Department: Sch of Informatics
Organisation: University of Edinburgh
Scheme: Standard Research
Starts: 24 January 2011 Ends: 23 April 2014 Value (£): 281,783
EPSRC Research Topic Classifications:
Artificial Intelligence Comput./Corpus Linguistics
EPSRC Industrial Sector Classifications:
No relevance to Underpinning Sectors
Related Grants:
Panel History:
Panel DatePanel NameOutcome
13 Jul 2010 ICT Prioritisation Panel (July 2010) Announced
Summary on Grant Application Form
In recent years, the field of natural language processing (NLP) has made great advances in a wide range of areas, such as machine translation, document summarization, and topic identification. However, much of this success is due to systems that are built using large quantities of human-annotated data in a supervised machine learning approach. This means that languages with fewer annotated resources (low-density languages) are left without much useful language technology. An important direction in NLP research is therefore to improve our ability to develop successful systems using as little annotated data as possible. Research on completely unsupervised systems is particularly interesting not only for its potential to broaden the reach of NLP technology, but also because it may shed light on the ways in which human infants manage to learn language with little or no explicit instruction.We propose to focus on the particular problem of word segmentation, and to develop a new type of probabilistic model, the infinite noisy channel model, for solving this problem in settings where little or no annotated data is available. Word segmentation refers to the problem of identifying word boundaries in either text or speech. It arises in NLP systems for many Asian languages, where words are not separated by whitespace, and also for infants learning language, because most spoken words are not separated by pauses. Previous work on unsupervised word segmentation has assumed that every time a particular word occurs, it is realized in exactly the same way. However, this is not the case for infants learning language (since words are subject to phonetic variability and noise in pronunciation), nor is it always true in NLP (if the input text contains errors, such as those produced by an optical character recognition system). Our new model will address this shortcoming by simultaneously performing word segmentation and correction of noise and variability, to recover a sequence of de-noised words from the unsegmented noisy input. We plan to develop two different versions of our model. One of these will be designed to correct for phonetic variability, and will be evaluated as a cognitive model of human language acquisition. With this model, we hope to gain insight into the computational mechanisms that allow infants to successfully extract words from noisy input, and in particular to show that the Bayesian inference techniques used in our model are a plausible explanation of infants' learning behavior. The second version of our model will be designed to correct for errors resulting from optical character recognition, and will be evaluated as a word segmentation and error-correcting NLP application in several different languages. We hope to show that the model reduces the number of character errors in the document while also producing successful segmentations. We expect these improvements to be particularly pronounced in low-density language situations.
Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Impacts
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.ed.ac.uk