Project

Back to overview

Developing Immunity against Failures in Large Concurrent Software Systems

English title Developing Immunity against Failures in Large Concurrent Software Systems
Applicant Candea George
Number 120309
Funding scheme Project funding
Research institution Laboratoire des systèmes fiables EPFL - IC - IIF - DSLAB
Institution of higher education EPF Lausanne - EPFL
Main discipline Information Technology
Start/End 01.03.2009 - 31.05.2012
Approved amount 297'075.00
Show all

Lay Summary (English)

Lead
Lay summary
Software systems have many bugs. Software is getting larger and more complex, with many systems consisting of many millions of lines-of-code: Windows, Oracle, Linux, to name just a few. The rule of thumb in the software industry is that production code contains 5-10 bugs per KLOC, implying that these large systems have many thousands of bugs lurking within, that programmers find too difficult or expensive to eradicate.Large systems are often highly concurrent, and writing correct concurrent systems is one of the most challenging endeavors of software development. With the advent of cheap, parallel hardware such as multi-core CPUs, new software is written with more parallelism and existing systems achieve higher degrees of run-time concurrency, thus exercising more untested paths. Already some of the most insidious bugs are blamed on concurrency; we expect things to get worse.Significant advances have been made in software development tools, but the rate at which they can reduce bugs/KLOC (i.e., bug density) is outpaced by the rate at which software size grows in KLOC (i.e., code volume). E.g., Linux code size more than doubled in the last 5 years and Windows quadrupled in less than 10 years. The net effect of these disparate rates of progress is more overall bugs. 

In this project, we enabled software systems to learn from past failures and get "stronger" over time. We propose a set of runtime techniques that, with every encounter of a new failure, progressively improve the system's ability to avoid those failures in the future--this is what we call "developing immunity against failures." The specific topic of study is mechanisms for programs to automatically develop immunity against failures that can be avoided with alternate execution paths. 

We built a system, called Dimmunix, that enables general-purpose applications to defend themselves against deadlock bugs, i.e., avoid deadlocks that they previously encountered. Dimmunix is implemented for Java, POSIX Threads, and Android OS. POSIX Threads and Android Dimmunix currently provide immunity against deadlocks involving mutex locks. Android Dimmunix is implemented within the Dalvik VM, which runs all the Android applications; therefore, Android Dimmunix provides platform-wide deadlock immunity, to all applications running on an Android phone. We also optimized the Java Dimmunix for synchronization-intensive applications. We extended Java Dimmunix with immunity against non-mutex deadlocks, i.e., deadlocks involving read-write locks, semaphores, condition variables, or external synchronization. We ran Dimmunix with real applications, like JBoss, Limewire, Vuze, Eclipse, Apache ActiveMQ, MySQL server, and SQLite.    

We also implemented a collaborative version of Dimmunix, called Communix. Communix enables machines connected to the Internet to immunize each other against deadlocks. Once a node encounters a deadlock, the other nodes get protected against the deadlock, without having to encounter the deadlock.

Dimmunix is available in open-source form for both Java and C/C++ from http://dslab.epfl.ch/proj/dimmunix.

Direct link to Lay Summary Last update: 21.02.2013

Responsible applicant and co-applicants

Employees

Publications

Publication
Communix: A Collaborative Deadlock Immunity Framework
Jula Horatiu, Tozun Pinar, Candea George (2011), Communix: A Collaborative Deadlock Immunity Framework, in Intl. Conference on Dependable Systems and Networks (DSN).
Efficiency Optimizations for Implementations of Deadlock Immunity
Jula Horatiu, Andrica Silviu, Candea George (2011), Efficiency Optimizations for Implementations of Deadlock Immunity, in Intl. Conference on Runtime Verification (RV).
Platform-wide Deadlock Immunity for Mobile Phones
Jula Horatiu, Rensch Thomas, Candea George (2011), Platform-wide Deadlock Immunity for Mobile Phones, in Workshop on Hot Topics in System Dependability (HotDep).
WaRR: A Tool for High-Fidelity Web Application Record and Replay
Andrica Silviu, Candea George (2011), WaRR: A Tool for High-Fidelity Web Application Record and Replay, in Intl. Conference on Dependable Systems and Networks (DSN).
iProve: A Scalable Approach to Consumer-Verifiable Software Guarantees
Andrica Silviu, Jula Horatiu, Candea George (2010), iProve: A Scalable Approach to Consumer-Verifiable Software Guarantees, in Intl. Conference on Dependable Systems and Networks (DSN).
PathScore-Relevance: A Metric for Improving Test Quality
Andrica Silviu, Candea George (2009), PathScore-Relevance: A Metric for Improving Test Quality, in Workshop on Hot Topics in System Dependability (HotDep).
A Scalable, Sound, Eventually-Complete Algorithm for Deadlock Immunity
Jula Horatiu, Candea George (2008), A Scalable, Sound, Eventually-Complete Algorithm for Deadlock Immunity, in Intl. Conference on Runtime Verification (RV).
Deadlock Immunity: Enabling Systems To Defend Against Deadlocks
Jula Horatiu, Tralamazza Daniel, Zamfir Cristian, Candea George (2008), Deadlock Immunity: Enabling Systems To Defend Against Deadlocks, in Symposium on Operating Systems Design and Implementation (OSDI).

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
Intl. Conf. on Dependable Systems and Networks 13.07.2011 Hong Kong (China)
Intl. Conf. on Runtime Verification 13.07.2011 Berkeley, CA (USA)
Intl. Conf. on Dependable Systems and Networks 13.07.2010 Chicago, IL (USA)
Intl. Workshop on Hot Topics in System Dependability 13.07.2009 Lisbon (Portugal)
Intl. Workshop on Runtime Verification 13.07.2008 Budapest (Hungary)
Symposium on Operating Systems Design and Implementation 13.07.2008 San Diego, CA (USA)


Use-inspired outputs


Start-ups

Name Year
ConfErr open-source prototype 2011

Abstract

Software systems have many bugs. Software is getting larger and morecomplex, with many systems consisting of many MLOC: Windows, Oracle,Linux, to name just a few (MLOC is a common notation formillion-lines-of-code and KLOC for thousand-lines-of-code; quotedfigures generally do not include program comments.). The rule ofthumb in the software industry is that production code contains 5-10bugs per KLOC, implying that these large systems have many thousandsof bugs lurking within, that programmers find too difficult orexpensive to eradicate.Large systems are often highly concurrent, and writing correctconcurrent systems is one of the most challenging endeavors ofsoftware development. With the advent of cheap, parallel hardware,such as multi-core CPUs, new software is written with more parallelismand existing systems achieve higher degrees of run-time concurrency,thus exercising more untested paths. Already some of the mostinsidious bugs are blamed on concurrency; we expect things to getworse.Significant advances have been made in software development tools, butthe rate at which they can reduce bugs/KLOC (i.e., bug density) isoutpaced by the rate at which software size grows in KLOC (i.e., codevolume). E.g., Linux code size more than doubled in the last 5 yearsand Windows quadrupled in less than 10 years. The net effect of thesedisparate rates of progress is more overall bugs. It has beenproposed to accept the presence of bugs as a fact of life and devisemechanisms to recover from the bugs when they manifest. In thiscontext, we proposed using microrebooting for fast recovery, andexperimental evaluation showed a factor of 50x improvement inavailability for application servers. While microreboot-basedrecovery provides a band-aid, it does not improve a system's abilityto avoid future failures.In this project we will remedy that, by enabling software systems tolearn from past failures and get "stronger" every time they have toreboot. We propose a set of runtime techniques that, with everyencounter of a new failure, progressively improve the system's abilityto avoid those failures in the future--this is what we call DEVELOPINGIMMUNITY AGAINST FAILURES. We will pursue three topics of study inthis project:* Enable programs to automatically develop immunity against failures that can be avoided with alternate execution paths. We have preliminary results toward achieving DEADLOCK IMMUNITY by inducing alternate thread interleavings that avoid previously-encountered deadlock patterns. An early prototype prevents the reoccurrence of deadlocks in large Java programs (over 350 KLOC) with less than 2% performance overhead. We chose deadlocks as a first target because deadlock bugs are among the hardest and most expensive to fix in practice, so immunity against them is most beneficial. Given that deadlock failures are well specified, we are working on automatically generating formal proofs of deadlock-immunity properties, with the ultimate goal of having "software with guarantees," that promises to eventually be deadlock-free.* Enable immunity against failures caused by bugs that can be avoided by intelligently dropping or reordering the inputs that trigger those bugs. We have taken some early steps toward developing SOFTWARE FUSES that automatically filter inputs that have been noticed in the past to cause the system to fail. Fuses enable buggy programs to execute failure-free, because they filter out the "pathogens" that exercise the program's bugs. They also provide a means for quarantining suspect inputs until there is high confidence that they will not cause failures. In the simpler case of immunity against input-triggered crashes, we expect to formally prove the immunity property of systems protected by fuses.* Enable immunity against failures resulting from resource shortages by employing MICROREJUVENATION. Resource-induced failures (e.g., out of memory, out of file descriptors) usually take the form of crashes or performance degradation; the analysis of resource utilization patterns and forecasting of resource availability enables us to predict imminent failures. We develop a technique, microrejuvenation, to surgically reclaim resources with minimal negative impact on the system as a whole.
-