EPSRC Reference: |
EP/P031617/1 |
Title: |
Pin the Tail: Understanding Straggler Manifestation in Internet-based Distributed Systems |
Principal Investigator: |
Garraghan, Dr P |
Other Investigators: |
|
Researcher Co-Investigators: |
|
Project Partners: |
|
Department: |
Computing & Communications |
Organisation: |
Lancaster University |
Scheme: |
First Grant - Revised 2009 |
Starts: |
01 September 2017 |
Ends: |
30 November 2019 |
Value (£): |
96,599
|
EPSRC Research Topic Classifications: |
Fundamentals of Computing |
Networks & Distributed Systems |
|
EPSRC Industrial Sector Classifications: |
|
Related Grants: |
|
Panel History: |
Panel Date | Panel Name | Outcome |
19 Apr 2017
|
EPSRC ICT Prioritisation Panel April 2017
|
Announced
|
|
Summary on Grant Application Form |
Distributed systems are the essential elements that form the foundation for Internet infrastructure, and are critical for fulfilling the technological and societal needs of the digital age. Comprising Cloud datacenters, compute clusters, and the Internet of Things, these systems are responsible for the effective provisioning and execution of a multitude of parallelizable applications. The increased complexity and scale of these systems has resulted in the manifestation of emergent phenomena that substantially degrades overall system performance, and cannot be solved by simply increasing the number of compute nodes. This phenomena is known as The Long Tail Problem, whereby a small proportion of task stragglers - a small subset of tasks that execute abnormally slow - impede overall job completion time, and is systemic to all distributed systems that operate at sufficient scale. While work within this area attempts to address this problem through straggler detection or mitigation, their effectiveness is underpinned by understanding the precise underlying causes for straggler manifestation, and importantly determining what system conditions influence their occurrence. However achieving this understanding is incredibly challenging given the multitude of possible straggler root-causes - all of which can stem from diverse sub-system operational characteristics and their interactions with other sub-systems. As current understanding of straggler manifestation is restricted to a qualitative and high-level detail, it is presently impossible to determine what system operational conditions (e.g. cluster resource contention, temperature, failures) are highly likely to create a "perfect storm" for straggler occurrence. Determining the system conditions which influence the probability of straggler occurrence in different operational scenarios is vital towards achieving predictable and rapid parallel application execution, given the continued increase of system size and complexity.
The vision of this proposed research is to address our limited understanding of straggler manifestation and conduct in-depth analysis and modelling of Internet-based distributed systems to quantify the precise relationship between straggler occurrence and system behaviour. This study will involve analysis and modelling stragglers within real systems, performed through comprehensive experimentation to identify and extract key system parameters from virtual and physical sub-system operation across the entire distributed system architecture. A framework will be constructed capable of automated analysis to determine straggler root-cause within production systems, which will interface with an event-based simulation engine for determining the optimal system conditions for avoiding stragglers.
By working with leading international industrialists in massive-scale distributed systems, this work represents a significant step change towards solving The Long Tail Problem by providing much sought-out knowledge to truly understand straggler manifestation. As this problem is systemic across every type of large-scale distributed system, the impact of this work will have far reaching implications for both academia and industry, and will provide direct benefit to the competitiveness of the UKs digital economy within the short and long-term. This grant represents the first step towards realizing the research ambitious to scientifically understanding the operation of massive-scale Internet infrastructure, enabling the design of fault-tolerant techniques for future systems at unprecedented scale - a crucial objective towards realizing key emergent technologies for the future.
|
Key Findings |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Potential use in non-academic contexts |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Impacts |
Description |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk |
Summary |
|
Date Materialised |
|
|
Sectors submitted by the Researcher |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Project URL: |
|
Further Information: |
|
Organisation Website: |
http://www.lancs.ac.uk |