Details of Grant

EPSRC Reference:

EP/J017213/1

Title:

New challenges in high-dimensional statistical inference

Principal Investigator:

Samworth, Professor RJ

Other Investigators:

Researcher Co-Investigators:

Project Partners:

Department:

Pure Maths and Mathematical Statistics

Organisation:

University of Cambridge

Scheme:

EPSRC Fellowship

Starts:

30 November 2012

Ends:

29 November 2017

Value (£):

1,190,194

EPSRC Research Topic Classifications:

Statistics & Appl. Probability

EPSRC Industrial Sector Classifications:

Healthcare

Related Grants:

Panel History:

Panel Date	Panel Name	Outcome
13 Mar 2012	EPSRC Mathematics Fellowships - March 2012	Announced
30 Jan 2012	Mathematics Prioritisation Panel Meeting January 2012	Deferred

Summary on Grant Application Form

As a society, more and more of the activities that we take for granted

rely on sophisticated technology, and are dependent on the fast and

efficient handling of large quantities of data. Obvious examples

include the use of internet search engines and mobile telephones.

Similarly, recent advances in healthcare are partly due to improved,

highly data-intensive scanning equipment in hospitals, and the

development of new, effective drug treatments, which have been the

result of extensive scientific study with data at its core.

Nevertheless, such advances can only be achieved through the

development of appropriate statistical models and methods which enable

practitioners to extract useful information from these vast quantities

of data. In order to capture the complexity of the data generating

processes, these models are inevitably high-dimensional, and have been

the topic of an enormous amount of research in Statistics over the

last 15 years or so.

This proposal addresses some of the fundamental and important

challenges in handling the huge data sets that routinely arise in the

applications above, as well as many others. For instance, in

high-dimensional models, sparse estimators are crucial for stability and

interpretability. But these give only a point estimate of a

parameter, and typically practioners require more sophisticated

inferential statements to assess uncertainty. We will show how this

by done by proposing easy-to-use and robust p-values

based on these sparse estimators.

One of the most important applications of sparse estimators is in

biotechnology. Indeed, we will apply our methodology described above

in a high-dimensional cancer study carried out by Danish

biostatisticians that uses microarray techniques. We will select,

with an associated quantification of uncertainty, a handful of stable

distinguishing genes for diffuse large B-cell lymphomas, thereby

enhancing our understanding of these cancers.

Another application area facing high-dimensional challenges is

neuroscience, and we will work on a study of dyscalculia that uses the

brain imaging technique of Electroencephalography (EEG). Dyscalculia

is a mathematical disability that prevents normal arithmetic function.

Here, existing statistical techniques used by experimental

psychologists in this area are inadequate, and modern high-dimensional

methods have the potential to improve dramatically our understanding

of this disability.

In classification problems, the challenge is to assign an observation

to one of two or more classes based on its similarity to (labelled)

data from each of these classes. They are some of the most frequently

encountered high-dimensional statistical problems, particularly in

fields such as machine learning and areas of computer science such as

computer vision and robotics. We will provide a simple and robust

improvement to perhaps the most popular method (the k-nearest

neighbour classifier), by weighting the nearest neighbours in an

optimal fashion. We will also give a quantification of the

improvement. A related problem we will study is to quantify the

uncertainty of a classifier constructed from training data. As an

example this could be used to give a doctor a measure of uncertainty

in a diagnosis.

The final main issue we will address concerns model misspecification.

This is a particularly important issue in high-dimensional statistical

problems, where it is almost inevitable that our model misses some

important effects, or does not model them in the correct way. We will

provide understanding of how statistical procedures perform in such

circumstances and develop new ones that are robust to model

misspecification. A particularly important application will be to

Independent Component Analysis models that are very popular in

statistical signal processing for analysing data arising from multiple

sources, including microarray and brain imaging data.

Key Findings

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Potential use in non-academic contexts

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Impacts

Description	This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised

Sectors submitted by the Researcher

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Project URL:

Further Information:

Organisation Website:

http://www.cam.ac.uk