Back to overview

Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures

Type of publication Peer-reviewed
Publikationsform Proceedings (peer-reviewed)
Author Rosà Andrea, Chen Lydia Y., Binder Walter,
Project LoadOpt - Workload Characterization and Optimization for Multicore Systems
Show all

Proceedings (peer-reviewed)

Title of proceedings 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Place Rio de Janeiro, Brazil
DOI 10.1109/dsn.2015.37


Motivated by the high system complexity of today’s datacenters, a large body of related studies tries to understand workloads and resource utilization in datacenters. However, there is little work on exploring unsuccessful job and task executions. In this paper, we study three types of unsuccessful executions in traces of a Google datacenter, namely fail, kill, and eviction. The objective of our analysis is to identify their resource waste, impacts on application performance, and root causes. We first quantitatively show their strong negative impact on CPU, RAM, and DISK usage and on task slowdown. We analyze patterns of unsuccessful jobs and tasks, particularly focusing on their interdependency. Moreover, we uncover their root causes by inspecting key workload and system attributes such as machine locality and concurrency level. Our results help in the design of low-latency and fault-tolerant big-data systems.