EPSRC Reference: |
EP/J017213/1 |
Title: |
New challenges in high-dimensional statistical inference |
Principal Investigator: |
Samworth, Professor RJ |
Other Investigators: |
|
Researcher Co-Investigators: |
|
Project Partners: |
|
Department: |
Pure Maths and Mathematical Statistics |
Organisation: |
University of Cambridge |
Scheme: |
EPSRC Fellowship |
Starts: |
30 November 2012 |
Ends: |
29 November 2017 |
Value (£): |
1,190,194
|
EPSRC Research Topic Classifications: |
Statistics & Appl. Probability |
|
|
EPSRC Industrial Sector Classifications: |
|
Related Grants: |
|
Panel History: |
|
Summary on Grant Application Form |
As a society, more and more of the activities that we take for granted
rely on sophisticated technology, and are dependent on the fast and
efficient handling of large quantities of data. Obvious examples
include the use of internet search engines and mobile telephones.
Similarly, recent advances in healthcare are partly due to improved,
highly data-intensive scanning equipment in hospitals, and the
development of new, effective drug treatments, which have been the
result of extensive scientific study with data at its core.
Nevertheless, such advances can only be achieved through the
development of appropriate statistical models and methods which enable
practitioners to extract useful information from these vast quantities
of data. In order to capture the complexity of the data generating
processes, these models are inevitably high-dimensional, and have been
the topic of an enormous amount of research in Statistics over the
last 15 years or so.
This proposal addresses some of the fundamental and important
challenges in handling the huge data sets that routinely arise in the
applications above, as well as many others. For instance, in
high-dimensional models, sparse estimators are crucial for stability and
interpretability. But these give only a point estimate of a
parameter, and typically practioners require more sophisticated
inferential statements to assess uncertainty. We will show how this
by done by proposing easy-to-use and robust p-values
based on these sparse estimators.
One of the most important applications of sparse estimators is in
biotechnology. Indeed, we will apply our methodology described above
in a high-dimensional cancer study carried out by Danish
biostatisticians that uses microarray techniques. We will select,
with an associated quantification of uncertainty, a handful of stable
distinguishing genes for diffuse large B-cell lymphomas, thereby
enhancing our understanding of these cancers.
Another application area facing high-dimensional challenges is
neuroscience, and we will work on a study of dyscalculia that uses the
brain imaging technique of Electroencephalography (EEG). Dyscalculia
is a mathematical disability that prevents normal arithmetic function.
Here, existing statistical techniques used by experimental
psychologists in this area are inadequate, and modern high-dimensional
methods have the potential to improve dramatically our understanding
of this disability.
In classification problems, the challenge is to assign an observation
to one of two or more classes based on its similarity to (labelled)
data from each of these classes. They are some of the most frequently
encountered high-dimensional statistical problems, particularly in
fields such as machine learning and areas of computer science such as
computer vision and robotics. We will provide a simple and robust
improvement to perhaps the most popular method (the k-nearest
neighbour classifier), by weighting the nearest neighbours in an
optimal fashion. We will also give a quantification of the
improvement. A related problem we will study is to quantify the
uncertainty of a classifier constructed from training data. As an
example this could be used to give a doctor a measure of uncertainty
in a diagnosis.
The final main issue we will address concerns model misspecification.
This is a particularly important issue in high-dimensional statistical
problems, where it is almost inevitable that our model misses some
important effects, or does not model them in the correct way. We will
provide understanding of how statistical procedures perform in such
circumstances and develop new ones that are robust to model
misspecification. A particularly important application will be to
Independent Component Analysis models that are very popular in
statistical signal processing for analysing data arising from multiple
sources, including microarray and brain imaging data.
|
Key Findings |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Potential use in non-academic contexts |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Impacts |
Description |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk |
Summary |
|
Date Materialised |
|
|
Sectors submitted by the Researcher |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Project URL: |
|
Further Information: |
|
Organisation Website: |
http://www.cam.ac.uk |