Back to overview

Predicting and Mitigating Jobs Failures in Big Data Clusters

Type of publication Peer-reviewed
Publikationsform Proceedings (peer-reviewed)
Author Rosà Andrea, Chen Lydia Y., Binder Walter,
Project LoadOpt - Workload Characterization and Optimization for Multicore Systems
Show all

Proceedings (peer-reviewed)

Title of proceedings Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Place Shenzhen, China
DOI 10.1109/ccgrid.2015.139


In large-scale datacenters, software and hardware failures are frequent, resulting in failures of job executions that may cause significant resource waste and performance deterioration. To proactively minimize the resource inefficiency due to job failures, it is important to identify them in advance using key job attributes. However, so far, prevailing research on datacenter workload characterization has overlooked job failures, including their patterns, root causes, and impact. In this paper, we aim to develop prediction models and mitigation policies for unsuccessful jobs, so as to reduce the resource waste in big datacenters. In particular, we base our analysis on Google cluster traces, consisting of a large number of big-data jobs with a high task fanout. We first identify the time-varying patterns of failed jobs and the contributing system features. Based on our characterization study, we develop an on-line predictive model for job failures by applying various statistical learning techniques, namely Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Logistic Regression (LR). Furthermore, we propose a delay-based mitigation policy which, after a certain grace period, proactively terminates the execution of jobs that are predicted to fail. The particular objective of postponing job terminations is to strike a good tradeoff between resource waste and false prediction of successful jobs. Our evaluation results show that the proposed method is able to significantly reduce the resource waste by 41.9% on average, and keep false terminations of jobs low, i.e., only 1%.