EPSRC Reference: |
EP/V002694/1 |
Title: |
New challenges in robust statistical learning |
Principal Investigator: |
Cannings, Dr T I |
Other Investigators: |
|
Researcher Co-Investigators: |
|
Project Partners: |
|
Department: |
Sch of Mathematics |
Organisation: |
University of Edinburgh |
Scheme: |
New Investigator Award |
Starts: |
01 May 2021 |
Ends: |
30 April 2024 |
Value (£): |
266,366
|
EPSRC Research Topic Classifications: |
Statistics & Appl. Probability |
|
|
EPSRC Industrial Sector Classifications: |
No relevance to Underpinning Sectors |
|
|
Related Grants: |
|
Panel History: |
|
Summary on Grant Application Form |
In recent years, our ability to collect, store and process vast amounts of data, coupled with rapid advances in technology, have led to the widespread adoption of data-driven decision-making. This includes new application areas, such as precision medicine, where doctors are using data to inform their diagnoses and treatment recommendations. In other areas, such as finance, banks use huge amounts of historical data in order to decide whether a new customer is likely (or not) to default on their loan repayments. It is often the case that we are required to make a discrete prediction about some future patient or customer, based on some (training) data relating to existing patients. In statistics, problems of this type are called classification problems.
Many methods for classification are built on the assumption that any future data we may encounter has the same distribution as our training data. Of course, this assumption is not always valid -- data relating to one set of patients or customers will not necessarily follow the same distribution as data from a new set of people. In this research, we will develop new robust classification algorithms that can deal with noisy and incomplete data. In particular, the new methodology will enable practitioners to combine multiple sources of noisy data, propose modifications to existing methods in order to guarantee they are robust to corruptions in the data, and introduce novel ways of overcoming the issues caused by missing data. We will also provide new theoretical understanding of the limitations of decision-making algorithms when faced with noisy, corrupted and incomplete data.
There are a number of scenarios where our new approaches will be applicable:
- We may have data collected from patients in a particular location (lab or hospital) but wish to make predictions in a different location.
- We may not have access to the full dataset. For example, for privacy reasons, uses may not disclose some of their personal information. In other settings, we may be required to anonymise the data by removing some identifying covariates.
- Often the complexity of the type of data involved will mean that we don't observe the true data. Instead, we only have access to an approximation of the data. This typically occurs in modern settings, where practitioners use crowd-sourcing services such as the Amazon Mechanical Turk to label their data -- such services are rarely perfectly accurate.
- It may be that an adversary is able to arbitrarily contaminate a small proportion of the data (for instance by performing artificial activity online).
Our work will enable practitioners to utilise data that is currently not appropriate for use. We will also provide new insight into the kinds of data that are most useful for a particular purpose.
|
Key Findings |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Potential use in non-academic contexts |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Impacts |
Description |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk |
Summary |
|
Date Materialised |
|
|
Sectors submitted by the Researcher |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Project URL: |
|
Further Information: |
|
Organisation Website: |
http://www.ed.ac.uk |