Zurück zur Übersicht

Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures

Publikationsart Peer-reviewed
Publikationsform Tagungsbeitrag (peer-reviewed)
Autor/in Rosà Andrea, Chen Lydia Y., Binder Walter,
Projekt LoadOpt - Workload Characterization and Optimization for Multicore Systems
Alle Daten anzeigen

Tagungsbeitrag (peer-reviewed)

Titel der Proceedings 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Ort Rio de Janeiro, Brazil
DOI 10.1109/DSN.2015.37


Motivated by the high system complexity of today’s datacenters, a large body of related studies tries to understand workloads and resource utilization in datacenters. However, there is little work on exploring unsuccessful job and task executions. In this paper, we study three types of unsuccessful executions in traces of a Google datacenter, namely fail, kill, and eviction. The objective of our analysis is to identify their resource waste, impacts on application performance, and root causes. We first quantitatively show their strong negative impact on CPU, RAM, and DISK usage and on task slowdown. We analyze patterns of unsuccessful jobs and tasks, particularly focusing on their interdependency. Moreover, we uncover their root causes by inspecting key workload and system attributes such as machine locality and concurrency level. Our results help in the design of low-latency and fault-tolerant big-data systems.