Zurück zur Übersicht

Cosmmus 2: an infrastructure for scalable distributed applications

Titel Englisch Cosmmus 2: an infrastructure for scalable distributed applications
Gesuchsteller/in Pedone Fernando
Nummer 169066
Förderungsinstrument Projektförderung (Abt. I-III)
Forschungseinrichtung Facoltà di scienze informatiche Università della Svizzera italiana
Hochschule Università della Svizzera italiana – USI
Hauptdisziplin Informatik
Beginn/Ende 01.10.2016 - 31.05.2019
Bewilligter Betrag 150'982.00
Alle Daten anzeigen

Keywords (3)

high availability; scalability; distributed systems

Lay Summary (Deutsch)

Lay summary

Many current online services must meet strict availability and performance requirements. High availability implies tolerating server failures, which calls for redundancy of servers and data, that is, replication. Although replication is needed for fault tolerance, having two or more replicas capable of serving client requests introduces additional complexity. For example, how to avoid that two clients simultaneously book the same airline seat, after each one accesses a different replica? Obviously, replicas must coordinate, but simply having one replica wait for a message from the other won’t work since replicas may fail.

State machine replication is a solution to this problem. The idea is to arrange for replicas to receive and execute client requests in the same order. The execution of each request must be deterministic so that replicas produce the same results after the execution of the same requests. Therefore, clients know that a request has been executed after they receive the first response from a replica. In the example above, the replicas will execute requests in the same order and agree that only the client whose request is executed first should book the seat.

State machine replication is a well-established solution to increasing the availability of services and has been deployed in many production systems. When it comes to performance, however, it offers limited capability. Increasing the number of replicas will not translate into more requests execute per time unit since each replica must process all requests. This project investigates extensions to the state machine replication that will make it configurable with respect to both availability and performance.

Direktlink auf Lay Summary Letzte Aktualisierung: 01.01.0001

Lay Summary (Italienisch)

Fernando Pedone
Lay summary

Molti servizi online odierni sono tenuti a soddisfare requisiti rigorosissimi di disponibilità e di prestazioni. Un’infrastruttura a disponibilità elevata deve poter tollerare guasti ai server, per cui solitamente si ricorre alla ridondanza di server e di dati ("repliche”), ossia all’uso della cosiddetta “replicazione". Nonostante la replicazione sia necessaria per tollerare eventuali guasti, avere due o più repliche in grado di servire le richieste dei clienti comporta ulteriori complessità. Ad esempio, come evitare che due clienti riservino lo stesso posto aereo contemporaneamente, dopo che ognuno abbia effettuato l’accesso tramite una replica differente?

La replicazione di macchine a stati è una soluzione a tali problemi. L’idea è quella di predisporre le repliche per ricevere ed eseguire le richieste nello stesso ordine. L’esecuzione di ogni richiesta deve essere deterministica di modo che le repliche producano gli stessi risultati dopo l’esecuzione delle stesse richieste. Di conseguenza, i clienti sanno che una richiesta è stata eseguita dopo aver ricevuto la prima risposta da una replica. Nell’esempio sopraccitato, nonostante ogni cliente abbia contattato una replica differente, le loro richieste saranno eseguite dalle due repliche nello stesso ordine, e le repliche si accordano in maniera che solo il cliente la cui richiesta viene eseguita per prima possa riservare il posto.

La replicazione di macchine a stati è una soluzione ormai consolidata per aumentare la disponibilità dei servizi, ed è stata utilizzata in svariati sistemi di produzione. Per quanto riguarda le prestazioni, tuttavia, offre capacità limitate. L’aumento del numero di repliche non si traduce in un aumento di esecuzione di richieste per unità di tempo, dal momento che ogni replica deve elaborare tutte le richieste. Questo progetto indaga le estensioni della replicazione di macchine a stati che la rendano configurabile dal punto di vista sia della disponibilità che delle prestazioni.  


Direktlink auf Lay Summary Letzte Aktualisierung: 21.11.2016

Verantw. Gesuchsteller/in und weitere Gesuchstellende


Name Institut

Verbundene Projekte

Nummer Titel Start Förderungsinstrument
146404 Cosmmus: an infrastructure for massively multiplayer online games 01.04.2013 Projektförderung (Abt. I-III)


This proposal is an extension of a project on infrastructure for massively multiplayer online games, hereafter Cosmmus 1, SNF project number 146404, which will end on September 2016. Cosmmus 1 resulted in the development of the Scalable State Machine Replication approach, which was the main subject of the PhD thesis of Eduardo Bezerra. Cosmmus 1 also provided funding for Long Hoang Le to start his PhD.Cosmmus 1 was motivated by the design of infrastructure for massively multiplayer online games (MMOGs). In these games, many thousands of participants play simultaneously with one another in the same virtual environment. To support so many participants, a client-server infrastructure is generally used, where client requests are handled by a cluster of servers, and sometimes a datacenter. In this paradigm, the servers have the official copy of the game state and are responsible for computing state changes over time; clients are responsible for presenting the game state to the players. Cosmmus 1 looked into some of the key aspects of a distributed infrastructure to support MMOGs, namely consistency, performance, and fault tolerance.While the motivating application for Cosmmus 1 was MMOGs, the ultimate goal of the project was to design protocols that can be applicable in broader areas. In a larger sense, the project targeted applications whose state can be divided among several computers and require strong consistency, scalable performance, and configurable fault tolerance. The main result of the project is the design and implementation of Scalable State Machine Replication (S-SMR), an approach that builds on the well-established classical state machine replication (SMR). Similarly to SMR, S-SMR is a general-purpose replication technique. Any application that complies with SMR’s requirements (notably deterministic execution) can be deployed with S-SMR. Differently from SMR, S-SMR can scale performance with additional system resources. These resources are included in S-SMR in the form of additional state partitions.Achieving scalable performance in state machine replication is challenging. To cope with the complexity of the problem, S-SMR makes a number of simplifying assumptions. For example, one of these assumptions is that state variables are statically assigned to partitions. Determining a partitioning of the state that avoids load imbalances among partitions and favors efficient single-partition commands normally requires a good understanding about the workload. Even if enough information is available, finding a good partitioning is a complex optimization problem. Moreover, many online applications experience variations in demand. Ideally, S-SMR should be able to dynamically adapt the state partitioning to the workload. Dynamic partitioning introduces several challenging problems, which we describe in the proposal. In this context, the high-level goals of Cosmmus 2 are to continue the work started in Cosmmus 1 and extend the scope of Scalable State Machine Replication. In this document, we describe our contributions in Cosmmus 1 and detail the problems we intend to tackle in Cosmmus 2. This proposal requests funding for Long Hoang Le to complete his PhD thesis.