EPSRC Reference: |
EP/R013381/1 |
Title: |
Statistical and Computational Challenges in High-dimensional Data Analysis |
Principal Investigator: |
Shah, Dr RD |
Other Investigators: |
|
Researcher Co-Investigators: |
|
Project Partners: |
|
Department: |
Pure Maths and Mathematical Statistics |
Organisation: |
University of Cambridge |
Scheme: |
First Grant - Revised 2009 |
Starts: |
01 June 2018 |
Ends: |
31 March 2021 |
Value (£): |
100,220
|
EPSRC Research Topic Classifications: |
Statistics & Appl. Probability |
|
|
EPSRC Industrial Sector Classifications: |
|
Related Grants: |
|
Panel History: |
|
Summary on Grant Application Form |
We are living in an age of information: scientists, businesses and governments are collecting datasets of unprecedented size and complexity at an ever-increasing rate, with the hope of using statistics to discover patterns and help inform decisions that will shape the future of our society. Typically datasets consists of observations (e.g. patients) on which a number of variables have been measured (e.g. height, weight). Whilst modern datasets can have many observations, the trend today is towards datasets with a very large number of variables. This is particularly true in genomics where scientific advances have allowed researchers to collect detailed genetic information on patients amounting to thousands or even hundreds of thousands of variables. More generally, automated data collection has given rise to so-called high-dimensional datasets across a variety of disciplines. For example, in healthcare analytics, aspects of a patient's history can give rise to datasets with a huge number of variables indicating what combinations of drugs were prescribed at particular times.
The field of high-dimensional statistics is a response to the challenges posed by these sorts of datasets which often render infeasible more traditional approaches designed for settings with only a handful of carefully chosen variables. Whilst much progress has been made, there remain several challenges, and this proposal will address some key outstanding methodological problems. Our methods will be applicable in a wide variety of settings, but two areas of application we will explore in collaboration are genomics and healthcare analytics. Our proposal consists of three projects which are described below.
Often along with the variables measured on a number of observations, we have an outcome or response of interest whose relationship with the variables we wish to learn from the data. In many cases, this relationship can be complex and depend on interactions between several groups of variables. Searching for combinations of variables which only together contribute to the response presents a serious computational challenge as the number of subsets of variables to search through quickly grows with the size of the subset. Even examining interacting pairs of variables can be computationally infeasible when the number of variables in the tens of thousands. A key contribution of our research will be to develop new methods that can scale efficiently to capture high order interactions in high-dimensional data.
Uncertainty quantification for high-dimensional data, for instance producing p-values quantifying the significance of variables in determining the response, is crucial in order to avoid deriving false conclusions from data. However research on this important topic is still in its infancy with many existing approaches often highly unstable in practical settings. Our proposal will develop new robust and computationally efficient methods for p-value construction and other forms of uncertainty quantification for a variety of models.
In some settings we do not have a distinguished response but rather would like to understand relationships between the variables themselves. Graphical models provide a useful way to model such dependencies but the available methods are often not scalable to the size of datasets now faced by many practitioners. We will use new computational techniques to develop randomised algorithms that avoid explicitly assessing each pair of variables to determine their relationship but can still deliver estimates of the strongest dependencies. The method will have broad applicability, but for example with biological data can help to learn the network of dependencies governing the underlying biological processes.
|
Key Findings |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Potential use in non-academic contexts |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Impacts |
Description |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk |
Summary |
|
Date Materialised |
|
|
Sectors submitted by the Researcher |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Project URL: |
|
Further Information: |
|
Organisation Website: |
http://www.cam.ac.uk |