The field of Natural Language Processing (NLP) has made unprecedented progress over the last decade, fuelled by the introduction of increasingly powerful neural network models. These models have an impressive ability to discover patterns in training examples, and to transfer these patterns to previously unseen test cases. Despite their strong performance in many NLP tasks, however, the extent to which they "understand" language is still remarkably limited. The key underlying problem is that language understanding requires a vast amount of world knowledge, which current NLP systems are largely lacking. In this project, we focus on conceptual knowledge, and more in particular on:
(i) capturing what properties are associated with a given concept (e.g. lions are dangerous, boats can float);
(ii) characterising how different concepts are related (e.g. brooms are used for cleaning, bees produce honey).
Our proposed approach relies on the fact that Wikipedia contains a wealth of such knowledge. A key problem, however, is that important properties and relationships are often not explicitly mentioned in text, especially if they follow straightforwardly from other information, for a human reader (e.g. if X is an animal that can fly then X probably has wings). Apart from learning to extract knowledge expressed in text, we thus also have to learn how to reason about conceptual knowledge.
A central question is how conceptual knowledge should be represented. Current NLP systems heavily rely on vector representations. Each concept is then represented by a single vector. It is now well-understood how such representations can be learned, and they are straightforward to incorporate into neural network architectures. However, they also have important theoretical limitations in terms of what knowledge they can capture, and they only allow for shallow and heuristic forms of reasoning. In contrast, in symbolic AI, conceptual knowledge is typically represented using facts and rules. This enables powerful forms of reasoning, but symbolic representations are harder to learn and to use in neural networks. Moreover, symbolic representations are also limited because they cannot capture aspects of knowledge that are matters of degree (e.g. similarity and typicality), which is especially restrictive when modelling commonsense knowledge.
The solution we propose relies on a novel hybrid representation framework, which combines the main advantages of vector representations with those of symbolic methods. In particular, we will explicitly represent properties and relationships, as in symbolic frameworks, but these properties and relations will be encoded as vectors. Each concept will thus be associated with several property vectors, while pairs of related concepts will be associated with one or more relation vectors. Our vectors will thus intuitively play the same role that facts play in symbolic frameworks, with associated neural network models then playing the role of rules.
The main output from this project will consist in a comprehensive resource, in which conceptual knowledge is encoded in this hybrid way. We expect that our resource will play an important role in NLP, given the importance of conceptual knowledge for language understanding and its highly complementary nature to existing resources. To demonstrate its usefulness, we will focus on two challenging applications: reading comprehension and topic/trend modelling. We will also develop three case studies. In one case study, we will learn representations of companies, by using our resource to summarise the activities of companies in a semantically meaningful way. In another case study, we will use our resource to identify news stories that are relevant to a given theme. Finally, we will use our methods to learn semantically coherent descriptions of emerging trends in patents.
|