EPSRC logo

Details of Grant 

EPSRC Reference: EP/L018497/1
Title: Tractable statistical inference from genomic data using diffusion models
Principal Investigator: Jenkins, Dr P
Other Investigators:
Researcher Co-Investigators:
Project Partners:
Lancaster University Regents of the Univ California Berkeley University of Oxford
Department: Statistics
Organisation: University of Warwick
Scheme: First Grant - Revised 2009
Starts: 01 April 2014 Ends: 31 March 2016 Value (£): 92,890
EPSRC Research Topic Classifications:
Statistics & Appl. Probability
EPSRC Industrial Sector Classifications:
No relevance to Underpinning Sectors
Related Grants:
Panel History:
Panel DatePanel NameOutcome
27 Nov 2013 Mathematics Prioritisation Panel Meeting Nov 2013 Announced
Summary on Grant Application Form
If we were to obtain the sequence of my genome, we'd see a string of three billion letters from the DNA alphabet. Now let's sequence yours and compare the two. At most positions they'd be identical - we are both human - but at a small number, less than 1%, we'd seem some variation. As the costs of DNA sequencing fall, obtaining your own genome will soon no longer be a hypothetical question. We are presently on the cusp of obtaining the genomes of thousands of people, providing a glimpse into the complex pattern of genetic variation across all humans. This data encodes a great deal of biological information such as the rate of mutations, and it also contains information about human demographic history, such as recent historical population size changes and migrations. Can we infer these things just from the genetic data?

Given its rich and complex source of data, this has occupied statisticians, probabilists, and geneticists for many years. The key to this type of statistical inference is a suitable stochastic model: one important model is known as the Wright-Fisher diffusion. It describes the random fluctuations through time of the frequency of a variant in a large population - that is, it traces a trajectory for how prevalent the variant was at each point in time. Performing inference with diffusion models can be difficult. The purpose of this research is to contribute to making such inference tractable.

The approach here is to use a computationally-intensive, simulation-based, statistical technique: rather than work exhaustively, we simulate some random, representative samples from the model and average over them. A computer can provide us with a large number of samples, so that the error is expected to be small provided we wait long enough. So successful is this idea that it is used throughout science and engineering. Here, we must simulate paths from the Wright-Fisher diffusion - the random, unobserved trajectories of historical frequencies of genetic variants. Ensuring such simulation can be carried out efficiently on this and related diffusions is a first task of the research. Because of the generality of the models and the techniques involved, this has the potential to aid researchers in many fields outside genetics too.

Given a method for sampling from the model, our next task is to embed it into an inference algorithm. However, this approach has been little applied to the framework of the Wright-Fisher diffusion, and there are open questions on the design of such an algorithm that this research will address, including some important specific issues. For example, we might simulate our diffusion path by many small, local increments, building up its trajectory in very small time steps based on what the data looks like at that time. We should hope that these trajectories will be consistent with the observed data overall, but ensuring such consistency is a global, not local, problem. The project will also address this issue.

Finally, we must specialize the algorithms for the analysis of genetic data. So that the work can be made accessible, convenient software will also be developed. Analysis of genetic data has the potential to provide a range of benefits: among other things, we can learn about human origins from ancient DNA, the evolution of pathogens, the progression of a tumour, the importance of natural selection, and the recent demographic history of humans. The latter is important as a vital first step in predicting the nature of human genetic variation, which in turn is fundamental in our understanding of the genetic basis of the risk of many complex diseases.

Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Impacts
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.warwick.ac.uk