Back to overview

Understanding Unsuccessful Executions in Big-Data Systems

Type of publication Peer-reviewed
Publikationsform Proceedings (peer-reviewed)
Author Rosà Andrea, Chen Lydia Y., Binder Walter,
Project LoadOpt - Workload Characterization and Optimization for Multicore Systems
Show all

Proceedings (peer-reviewed)

Title of proceedings Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Place Shenzhen, China
DOI 10.1109/ccgrid.2015.138


Big-data applications are being increasingly used in today’s large-scale datacenters for a large variety of purposes, such as solving scientific problems, running enterprise services, and computing data-intensive tasks. Due to the growing scale of these systems and the complexity of running applications, jobs running in big-data systems experience unsuccessful terminations of different nature. While a large body of existing studies sheds light on failures occurred in large-scale datacenters, the current literature overlooks the characteristics and the performance impairment of a broader class of unsuccessful executions which can arise due to application failures, dependency violations, machine constraints, job kills, and task preemption. Nonetheless, deepening our understanding in this field is of paramount importance, as unsuccessful executions can lower user satisfaction, impair reliability, and lead to a high resource waste. In this paper, we describe the problem of unsuccessful executions in big-data systems, and highlight the critical importance of improving our knowledge on this subject. We review the existing literature on this field, discuss its limitations, and present our own contributions to the problem, along with our research plan for the future.