Project

Back to overview

Melise - Machine Learning Assisted Software Development

English title Melise - Machine Learning Assisted Software Development
Applicant Gall Harald
Number 204632
Funding scheme Project funding
Research institution Institut für Informatik Universität Zürich
Institution of higher education University of Zurich - ZH
Main discipline Information Technology
Start/End 01.11.2021 - 31.10.2025
Approved amount 465'001.00
Show all

Keywords (5)

mining software repositories; software engineering; software evolution; machine learning; AI for SE

Lay Summary (German)

Lead
Maschinelles Lernen (ML) bietet ein enormes Potential für Software Engineering, um aus den mannigfaltigen und umfangreichen Daten über Softwareaktivitäten, Prozesse und Entwickler zu lernen und diese gelernten Zusammenhänge wieder für die Software Entwicklung zu nutzen. In diesem Projekt sollen diese Daten genutzt werden, um ML Modelle zu konzipieren, die Software Entwickler effektiv unterstützen können. Das Ziel ist, Mechanismen zu entwerfen, um diese Modell kontinuierlich weiterzuentwickeln und damit stets aktuell, präzise und effektiv zu halten.
Lay summary
Der Einsatz von Maschinellem Lernen (ML) zur Unterstützung der Software Entwicklung hat in jüngster Zeit grosse Aufmerksamkeit in Wissenschaft und Wirtschaft erfahren. Eines der grossen Probleme von ML Modellen ist, dass diese zwar anfangs entwickelt aber danach nicht wiederholt aktualisiert oder erneut trainiert (gelernt) werden. Dies bietet einen wichtigen Ansatzpunkt für unser Projekt, um die kontinuierlich in der Software Entwicklung generierten Daten zu nutzen: von Fehlerreports, User Reviews, Versionsverwaltung, zu Quellcode Änderungen und Test. Software Projekte entwickeln sich rasch weiter, während die ML Modelle dies nicht ausreichend können. Diese Modelle werden daher nicht nur ungenau, sondern veraltet.

Das Ziel unseres Projekts ist es zu untersuchen, wie diese Datenströme der Software Entwicklung genutzt werden können, um ML Modelle wiederholt zu trainieren, unter Einbezug von Entwicklern. Dafür ziehen wir zwei Referenzprobleme heran: Fehlervorhersage und Aufwandsschätzung. Unsere Hypothese ist, dass Software Entwicklung und ML Modellentwicklung Hand in Hand gehen müssen. Damit sollen die Grundlagen gelegt werden für ML-unterstützte Software Entwicklung.

Direct link to Lay Summary Last update: 25.10.2021

Lay Summary (Italian)

Lead
Machine Learning (ML) sta prendendo rapidamente piede in molti campi, tra cui la software engineering. In teoria, i modello possono imparare dagli esempi per semplificare le attività degli sviluppatori software. L’intero linea di sviluppo software, che include sia processi che persone, ma anche le esperienza degli utenti, produce un importante quantità di dati. Con questo progetto, miriamo a sfruttare tale abbondanza di dati per produrre modelli di ML che possano efficacemente assistere gli sviluppatori software. L’obiettivo e quello di ideare metodi per evolvere, in maniera continua, i modelli di ML, in modo da mantenere questi ultimi aggiornati e accurati per lo sviluppo software.
Lay summary
Negli ultimi anni, l’uso di modelli di Machine Learning (ML) per supportare le attività di software engineering ha assistito ad un considerevole interesse, sia da parte dell’accademia che dai professionisti. Una volta distribuiti, però, questi modelli diventano un’entità statica, scarsamente aggiornati o riallenati. Noi crediamo che questa sia una grossa mancanza: un vasto ammontare di dati è infatti continuamente generato durante il processo di sviluppo, come i bug report, recensioni degli utenti, o eventi che avvengono sul repository. Sfortunatamente, un modello di ML statico non può trarre beneficio da questa mole di informazioni. In poche parole, mentre uno specifico progetto software continua a evolversi, il modello di ML che dovrebbe supportare le attività di sviluppo, rimane invece invariato. Come conseguenza, questo può portare ad una perdita di accuratezza del modello stesso, che lentamente ed inevitabilmente, diviene obsoleto.

L’obiettivo del nostro progetto è quello di investigare se il flusso di dati, generato durante lo sviluppo del software, possa essere sfruttato con successo per migliorare i modelli di ML. Inoltre, pianifichiamo di raccogliere dati aggiuntivi addizionali, basati sul diretto riscontro dagli sviluppatori, per migliorare continuamente i modelli. Per esempio, ogni qual volta che un avviso è prodotto dal modello, successivamente confermato da uno sviluppatore, ci aspettiamo di poter usare tale informazione per riallenare il modello.

In questo progetto, ci concentreremo sui problemi della bug prediction ed effort estimation come riferimento. Crediamo che l’evoluzione del software debba andare di pari passo con quella dei modelli di ML, e che l’utilizzo di feedback possa essere un punto chiave per far ciò. Come risultato, ci aspettiamo di stabilire le fondamenta necessarie per uno sviluppo che sia assistito dal ML, che prenda in considerazione l’apprendimento dal contesto in cui è sfruttato.
Direct link to Lay Summary Last update: 25.10.2021

Lay Summary (English)

Lead
Machine learning (ML) is taking ground rapidly in many fields such as software engineering. The models can in theory learn from examples to ease the software developers' activities. The entire software development pipeline, which includes both process and people, but also the experience of users, produces an impressive amount of data. With this project, we aim at leveraging this rich data to produce machine learning models that can effectively assist software developers. The goal is to devise means to continuously evolve ML models to keep the models up to date and accurate for software development.
Lay summary
The usage of Machine Learning (ML) models to support software engineering tasks has witnessed considerable interest from both academics and practitioners in the recent years. However, once deployed, ML models often become a static entity, with being rarely updated or re-trained. We believe this is a missed opportunity: a vast amount of data is continuously generated during the development process, such as bug reports, user reviews, or repository events, and a static ML model cannot gain advantage from it. In other words, while a specific software project keeps evolving, the ML model supposed to support its development practices remains unchanged. As a consequence, this can lead to a loss of accuracy of the model itself, slowly but inevitably becoming obsolete.

The goal of our project is to investigate whether the data stream created in software development can be successfully exploited to re-train ML models and, consequently, to improve them. Furthermore, we plan to gather an additional active data stream based on direct feedback from developers, to continuously improve ML models. For instance, every time a warning produced by the model is assessed by a developer, we can re-train the model and, reward it in case correct warnings were generated.

In this project, we will focus on the reference problems of bug prediction and effort estimation. We claim that software evolution and ML model evolution need to go hand in hand, and feedback loops are key for that. As a result, we will devise the necessary foundations for ML-assisted software development that takes into account learning from the context and avoiding the typical concept drift of ML.
Direct link to Lay Summary Last update: 25.10.2021

Responsible applicant and co-applicants

Employees

Abstract

The usage of Machine Learning (ML) models to support software engineering tasks has witnessed considerable interest from both academics and practitioners in the last year. Building an ML model takes several steps ranging from the elicitation of the requirements, feature engineering, and model training, to evaluation and deployment. This typical pipeline often contains feedback loops: for instance, unsatisfactory training results may loop back to the feature engineering phase. However, once deployed, ML models often become a static entity, in the sense that they are rarely updated or re-trained. We believe this is a huge missed opportunity: a vast amount of data is continuously generated during the development process, e.g., issue tracker posts, user reviews, or repository events, and a static ML model cannot gain any advantage from it. In other words, while a specific software project keeps evolving, the ML model supposed to support its development practices remains unchanged. As a consequence, this can lead to a loss of accuracy of the model itself, slowly but inevitably becoming obsolete.The goal of our project is to investigate whether the data stream created in software development can be successfully exploited to re-train ML models and, consequently, to improve them. However, such a data stream is not the only piece of information that we aim to use to accomplish this task. While this constitutes passive information, we plan to gather an additional active stream provided by the developers. Indeed, we want to implement a user-based feedback loop mechanism with the goal of continuously improving ML models. Our goal is to use it in a reinforcement learning fashion: every time a warning produced by the model is assessed by a developer, we can re-train the model and, for instance, reward it in case correct warnings were generated.In this project, we will focus on bug prediction and effort estimation as ML models. We claim that software evolution and ML model evolution need to go hand in hand, and feedback loops are key for that. We plan to create a comprehensive benchmark dataset that can be used for selecting generalizable and effective defect prediction approaches. We will devise a feedback loop exploiting both active (user-based) and passive data streams, with the goal of continuously improving ML models. As a result, we will devise the necessary foundations for ML-assisted software development that takes into account learning from the context and avoiding the typical concept drift of ML.
-