EPSRC Reference: |
EP/K024043/1 |
Title: |
Statistical Natural Language Processing Methods for Computer Program Source Code |
Principal Investigator: |
Sutton, Dr C |
Other Investigators: |
|
Researcher Co-Investigators: |
|
Project Partners: |
|
Department: |
Sch of Informatics |
Organisation: |
University of Edinburgh |
Scheme: |
Standard Research |
Starts: |
01 October 2013 |
Ends: |
31 March 2017 |
Value (£): |
375,602
|
EPSRC Research Topic Classifications: |
Artificial Intelligence |
Comput./Corpus Linguistics |
Fundamentals of Computing |
Software Engineering |
|
EPSRC Industrial Sector Classifications: |
|
Related Grants: |
|
Panel History: |
Panel Date | Panel Name | Outcome |
21 Nov 2012
|
EPSRC ICT Responsive Mode - Nov 2012
|
Announced
|
|
Summary on Grant Application Form |
Complex software systems involve many components and make use of many external libraries. Programmers who work on such software must remember the protocols for using all of those components correctly, and the process of learning to use a new component can be time consuming and a source of bugs.
We believe that there is a major untapped resource that can help address this problem. Billions of lines of code are readily available on the Internet, much of which are of professional quality. Hidden within this code is a large amount of knowledge about good coding practices, for example, about avoiding error-prone constructs or about the best protocol for using a particular library. We envision a new type of programming tool, which could be called data-driven development tools, that aggregate knowledge about programming from a large corpus of mature software projects, for presentation within the development environment. Just as the current generation of IDEs helps developers to manage their code, the next generation of IDEs will help developers to learn how to write better code.
Fortunately, there is a research field that has already developed a large body of sophisticated tools for analyzing large amounts of text: namely, statistical natural language processing. The long-term strategic goal of this project is to develop new natural language processing techniques aimed at analyzing computer program source code, in order to help programmers learn coding techniques from the code of others. There is a large area for research here that has been almost completely unexplored.
As a first step in this research area, in this project we will focus on automatically identifying short code fragments, which we call idioms, that occur repeatedly across different software projects. An example of an idiom is the typical construct for iterating over an array in Java. Although they are ubiquitous in source code, idioms of this form have not to our knowledge been systematically studied, and we are unaware of any techniques for automatically identifying idioms. The main objective of this project is to develop new statistical NLP methods with the goal of automatically identifying idioms from a corpus of source code text. We call this research problem idiom mining, and it is to our knowledge a new research problem.
This is an interdisciplinary project that draws from statistical NLP, machine learning, and software engineering. The research work of this project is primarily in statistical NLP and machine learning, and will involve developing new statistical methods for finding idioms in programming language text.
|
Key Findings |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Potential use in non-academic contexts |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Impacts |
Description |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk |
Summary |
|
Date Materialised |
|
|
Sectors submitted by the Researcher |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Project URL: |
|
Further Information: |
|
Organisation Website: |
http://www.ed.ac.uk |