Lead
High performance computers are parallel systems with shared and distributed memory. The number of computing units in such systems increased over the years and will continue to increase in the future. This results in computing systems with massive amounts of hardware parallelism. Hardware parallelism is complemented by software parallelism. A good match between the degrees and scales of these two types of parallelism at the various levels of the high performance computers ecosystem is key in exploiting the computational power delivered by these machines.

Lay summary

Content and research objectives 

Hardware parallelism ranges from machine instructions to global compute sites. Similarly, software parallelism ranges from scalar instructions to global job queues. Exploiting the available hardware parallelism even at a single level is notoriously challenging. This is partly due to difficulty in exposing and expressing parallelism in applications.

The project will answer the question: Given massive parallelism, at multiple levels and of diverse forms and granularities, how can it be exposed, expressed, and exploited such that execution times are reduced, performance targets are achieved, and acceptable efficiency is maintained?

This project concentrates on scheduling and load balancing.

In this project we propose a multilevel scheduling (MLS) approach for achieving scalable scheduling in large scale high performance computing systems across the multiple levels of parallelism, with a focus on software parallelism.

The MLS approach will leverage all available parallelism and address hardware heterogeneity in large scale high performance computers such that execution times are reduced, performance targets are achieved, and acceptable efficiency is maintained. The methodology for reaching the multilevel scheduling aims involves theoretical research studies, simulation, and experiments.

Scientific and social context of the research project

This project leverages the most efficient existing scheduling solutions to extend them beyond one or two levels, respectively, and to scale them out within single levels of parallelism.

The project aims to make a fundamental advance toward simpler to use large scale high performance computing systems, with impacts not only in the computer science community but also in all computational science domains.