Project

Back to overview

Online Data Center Modeling

Applicant Soulé Robert
Number 159537
Funding scheme Project funding (Div. I-III)
Research institution Istituto di sistemi informatici (SYS) Facoltà di scienze informatiche
Institution of higher education Università della Svizzera italiana - USI
Main discipline Information Technology
Start/End 01.12.2015 - 30.11.2018
Approved amount 347'014.00
Show all

Keywords (3)

Data modeling; Networking; Data center

Lay Summary (Italian)

Lead
Gli odierni centri di elaborazione dati (data center) sono un'infrastruttura critica ma sono anche sistemi complessi dinamici e altamente interconnessi di cui è difficile capire e predire le prestazioni, i comportamenti e i malfunzionamenti. Il problema principale è che non esiste un modello generale per rappresentare lo stato di un data center che esprima bene la configurazione di rete e il carico a tutti i livelli.Dunque l'obiettivo di questo progetto è di formulare un modello onnicomprensivo dello stato operativo di un moderno data center. Un tale modello costituirà una piattaforma comune per svariate funzionalità di gestione e svolgerà un ruolo fondamentale analogo a quello che il modello relazionale nelle basi di dati ha per i dati tabulati. Inoltre, un tale modello può agire da catalizzatore per l'industria del software di gestione e può essere una base per l'interoperabilità e per svariate procedure e tecniche di analisi e controllo per le reti di calcolatori.
Lay summary

Gli odierni centri di elaborazione dati (data center) sono un'infrastruttura critica, ma sono anche sistemi complessi dinamici e altamente interconnessi di cui è difficile capire e predire le prestazioni, i comportamenti e i malfunzionamenti.  Queste difficoltà derivano del fatto che i data center operano e richiedono analisi a diversi livelli (connettività di rete fisica, server virtuali e fisici, reti virtuali, instradamento, architetture orientate ai servizi, ambienti di esecuzione delle applicazioni, ecc.)  ma le tecniche e gli strumenti di gestione funzionano solo su uno o pochi livelli.  

Con questo progetto vogliamo costruire un modello di dati comune e una corrispondente rappresentazione dello stato operativo di un data center che possa essere inizializzata e aggiornata dai tracciati di misurazioni e configurazioni. Gli operatori possono poi usare questo modello per determinare proprietà globali (per esempio le matrici di traffico) e per condurre simulazioni dei carichi di lavoro per valutare i cambiamenti di configurazione.  Al momento un tale modello e rappresentazione di un moderno data center non esiste.

Questo progetto avrà un impatto ben oltre la realizzazione di strumenti per gli operatori di data center. L'obiettivo è di fornire una piattaforma comune per svariate funzionalità di gestione con un ruolo fondamentale analogo a quello che il modello relazionale nelle basi di dati ha per i dati tabulati.  Il nostro modello può agire da catalizzatore per l'industria del software di gestione, può essere una base per l'interoperabilità, per i confronti di prestazioni e per la verifica formale dei controllori SDN.  A lungo termine la nostra speranza è di arricchire la conoscenza dei principi di progettazione dei controllori di reti e dei sistemi distribuiti.

 

Direct link to Lay Summary Last update: 02.04.2015

Responsible applicant and co-applicants

Employees

Publications

Abstract

This project will create a common data model and representation for the state of an operational data center, driven by real-world use-cases and deployments, which can serve as a solid foundation for cross-layer management of entire distributed hardware and software stacks. Modern data centers are the crucial infrastructure for storing, processing, and distributing information for applications that touch all walks of life, including health care, finance, communications, and other industries. However, data centers are also complex, dynamic, highly networked systems and, as such, their capacity, performance, behavior, and failure modes are difficult to predict, understand, and plan for.A recent survey of federal agencies in the United States reports that 94 percent of federal data centers experience downtime as a result of data center complexity. Reports of data center outages have become commonplace. Notably, outages involving Amazon's EC2 Cloud Computing facilities in turn took down many important websites, of which Reddit, Github, Foursquare and Airbnb were only the most high-profile. A recent outage at Visa left customers unable to use their credit cards.Less visibly, our commercial collaborators in this project confirm a truism in the industry: ensuring that the applications in a data center meet their performance and availability targets in the face of changes in offered load, reconfigurations, upgrades, and equipment and software failures is an extremely difficult problem that costs companies dearly in expense, energy, and employee time. A major reason for this complexity is that the many conceptual layers involved in an enterprise data center (physical network connectivity, physical and virtual machines, link layers, VLANs, routing, service-oriented architectures, application deployment, etc.) are managed today by tools and techniques which focus on only one or a few layers. Worse, these layers are typically operated by different divisions in the organization, creating a strong barrier to the development of commercial management solutions. These layers are also becoming more complex in themselves. An key example of this general challenge is the rise of Software-Defined Networking (SDN) techniques, which promise considerably more efficient and flexible networking at the link and IP layers through a centralized "controller'' which dynamically creates forwarding table entries in switches. However, this power comes at the cost of increased complexity. SDN control software is itself a complicated distributed system,that must maintain distributed state gathered from a variety of heterogeneous devices using asynchronous communication. Anecdotally, network administrators have been hesitant to deploy SDNs, because the increased automation makes it harder for them to track down problems and generally understand anomalous behavior in the system. As a result of this reluctance, many networks only partially deploy SDNs, until they gain familiarity and confidence, leading to difficulties in which SDN and legacy network software co-exist in the same network.More importantly, today's SDN controllers to date (primarily based on the OpenFlow standard) operate at the level of IP flows rather than taking a global view of the data center state which can be reasoned about online. Moreover, they expose no internal representation that would allow the network layers they manage to be coupled to other layers of the stack (such as the several application levels, or the physical infrastructure).This project will take a radically different approach. Rather than focussing on mechanisms to control and manage subsets of a data center, we will create a data model and \emph{representation} of the state of a data center which can be populated and driven by logs, traces, and configuration information; queried by operators to determine global properties of the system (such as traffic matrices), and drive online workload-driven simulations to explore the effects of configuration changes. Such a data model and schema for a modern data center does notat present exist.Beyond these immediate applications, we argue that without such a foundational substrate, any attempt to manage an entire data center will result in a management system which is ad-hoc, brittle in the presence of changes, and highly specific to a single context - in short, it will resemble the "home-grown'' systems in use today. We are in a highly advantageous position to carry out this work. Our collaborators in industry have data centers and networks which are highly instrumented, and have agreed to share this instrumentation data with us. The work is also timely from a technological perspective: modern server technology can process the large volumes of trace data generated by a center (about 2TB/day in the case of Amadeus for example) in a compact space, and parallel data processing systems are appearing in the research community which can integrate graph processing, data streams, stored relational data, and continuous, incremental online queries within a single framework - exactly the workload that data center modelling presents. The primary goal of the proposed project is not to focus on the systems issues involved in building such as system. Instead, we plan to leverage existing research systems such as Naiad and ongoing work at ETH into parallel data processing and information management, as much as possible in the short term. As the project matures, we expect that the process of developing the data center model will provide insights into system design, that will lay a foundation for future systems work.Neither will the project address the "actuation'' part of data center management - it will concentrate solely on building and maintaining a representation rather than taking any action based on this. There are several reasons for this. Firstly, we feel strongly that without a clear logical foundation in representation, control policies will be too ad-hoc, and it is essential to get this representation correct first. Secondly, it is easier to deploy a prototype system operationally in a real data center (something we plan to do in this project) if it both delivers value to operators (by providing information views not otherwise available to them) and poses no threat to the infrastructure, instead merely ingesting trace data and providing a query interface over system state. Instead, the work in the proposal will perform the vital task of creating, refining, deploying, and validating the abstract representation of the state of a data center. In doing so, we will build upon but extend recent work in applying ideas from both knowledge representation and programming language semantics to understanding the operation of networks, including existing work by the PIs themselves.If successful, we expect the results of this project to have a broad impact far beyond providing useful tools for data center operators. Our goal is to provide a shared substrate for diverse data center management functionality, analogous to the way that the relational model of databases provided a common substrate for tabular data. Such an extensible model can act as a disruptive incentive to the management software industry, serve as a basis for interoperability, comparative benchmarking, and verifiable SDN controllers. In the longer term, we hope to greatly further our understanding of design principles for the control planes of networks and distributed systems.
-