Consider the problem of measuring the amount of material in a heap of sand. In effect one has to calculate the total volume lying under a surface extending over a given region. If the surface is given by a formula then it is often feasible to calculate the volume using integral calculus. However, the calculations may be demanding. Moreover, there may be no formula; it may only be possible to determine the heights experimentally at specified locations. In that case numerical approximations can be made, but this may be tricky if the surface is very irregular. This issue becomes even more problematic for analogous problems in higher dimensions.
In practice it is often preferable to employ a statistical method: one measures the heights at randomly chosen points, drawn uniformly from the region, and then averages the resulting measurements. Multiplied by the area of the region, this gives a statistical estimate of the volume, the accuracy of which increases as one samples more randomly chosen points. This method works well if the region in question is relatively simple, and is colourfully known as the Monte Carlo method.
Even the task of drawing random points from the region can be challenging; and may be overwhelmingly difficult for important higher dimensional analogues. So instead one designs a random sequence of points which, in the long run, are individually uniformly distributed. A simple example is to propose each point in turn as a random displacement from the previous point (using the normal distribution), rejecting the proposal if it falls outside the region. The first point is chosen arbitrarily, so the first few locations are biased, but this goes away in the long run. The average of measurements, multiplied by the area of the region, now yields a statistical estimate of the volume whose accuracy improves as more points are used. There are many variations on this idea, often using more general random processes such as suitable Markov chains, but the above illustrates the important ideas.
This technique is known as Markov chain Monte Carlo (MCMC), and is now used in a wide variety of different contexts, for example: statistical physics (the original context in which MCMC arose), randomised computer algorithms, study of bio-molecules, numerical integration, and (the focus of this project) applied statistics. In each case a different set of concerns lead to different emphases and different approaches.
In applied statistics one deals with probabilities of events rather then volumes of material, and with calculating long run means of different functions relating to probability distribution conditional on observations. However in essence the problem is as above. Issues include: "burn-in" (how long must a long-run average be, to avoid being biased by the choice of the first point) and "fast mixing" (how quickly does the sequence of sample points move around the sample space once equilibrium is attained). The project focuses on the question of fast mixing, and the question of how to choose the right kind of jump proposal to make mixing fast, and specifically the question of the best scale for the distribution of proposed displacements.
In 1997 Roberts et al produced theory giving clear practical advice on tuning the scale: in high dimensional situations approximately only a quarter of the proposals should be accepted. Much subsequent work has explored generalisations. However the method of proof, using stochastic differential equation (SDE) theory, has limitations. In a nutshell, it uses limits of random walks rapidly making small jumps, leading to severe smoothness restrictions. Recently it has been proposed to use "Dirichlet forms", based on averages of squares of jumps, thus not so restrictive. Additionally, the SDE approach works with a single aspect of the target probability distribution, while Dirichlet forms work with everything at once. The project will develop the Dirichlet form approach.
|