The demand for future conversational speech technologies is estimated to reach a market value of $3 billion by 2020 (Grand View Research, 2014). Our proposed technology will provide vital foundations and impetus for the rapid development of a next-generation of naturally interactive conversational interfaces with deep language understanding, in areas as diverse as healthcare, human-robot interaction, wearables, home automation, education, games, and assistive technologies.
Future conversational speech interfaces should allow users to interact with machines using everyday spontaneous language to achieve everyday needs. A commercial example with quite basic capabilities is Apple's Siri. However, even today's limited speech interfaces are very difficult and time-consuming to develop for new applications: their key components currently need to be tailor-made by experts for specific application domains, relying either on hand-written rules or statistical methods that depend on large amounts of expensive, domain-specific, human-annotated dialogue data. The components thus produced are of little or no use for any new application domain, resulting in expensive and time-consuming development cycles.
One key underlying reason for this status quo is that for spoken dialogue, general, scalable methods for natural language understanding (NLU), dialogue management (DM), and language generation (NLG) are not yet available. Current domain-general methods for language processing are sentence-based and so perform fairly well for processing written text, but they quickly run into difficulties in the case of spoken dialogue, because ordinary conversation is highly fragmentary and incremental: it naturally happens word-by-word, rather than sentence-by-sentence. Real conversation happens bit by bit, using half-starts, suggested add-ons, pauses, interruptions, and corrections -- without respecting the boundaries of sentences. And it is precisely these properties that contribute to the feeling of being engaged in a normal, natural conversation, which current state-of-the-art speech interfaces fail to produce.
We propose to solve these two problems together, by for the first time:
(1) combining domain-general, incremental, and scalable approaches to NLU, DM, and NLG;
(2) developing machine learning algorithms to automatically create working speech interfaces from data, using (1).
We propose a new method "BABBLE" in which speech systems can be trained to interact naturally with humans, much like a child who experiments with new combinations of words to discover their usefulness (though doing this offline to avoid annoying real users while doing so!).
BABBLE will be deployed as a developer kit and as mobile speech Apps for public use and engagement, and will also generate large dialogue data sets for scientific and industry use.
This new method will not require expensive data annotation or expert developers, leading to easy creation of new speech interfaces that advance the state-of-the-art in interacting more naturally, and therefore more successfully and engagingly with users.
New advances have been made in key areas relevant to this proposal: incremental grammars, formal semantic models of dialogue, and sample-efficient machine learning methods. The opportunity to combine and develop these approaches has arisen only recently, and now makes major advances in spoken dialogue technology possible.
|