Project

Back to overview

Programming Language Abstractions for Big Data

English title Programming Language Abstractions for Big Data
Applicant Odersky Martin
Number 167213
Funding scheme NRP 75 Big Data
Research institution Laboratoire de méthodes de programmation 1 EPFL - IC - IIF - LAMP1
Institution of higher education EPF Lausanne - EPFL
Main discipline Information Technology
Start/End 01.03.2017 - 28.02.2021
Approved amount 599'928.00
Show all

Keywords (6)

generic programming; functional programming; Scala; big data frameworks; distributed programming; programming languages

Lay Summary (German)

Lead
Scala ist eine der führenden Programmiersprachen für Data-Science-Plattformen und -Tools. Dieses Projekt entwickelt neue Konzepte, um die in diesem Bereich eingesetzten Programmiersprachen verständlicher und benutzerfreundlicher zu machen.
Lay summary

Das Projekt besteht aus mehreren Teilen. Das erste Arbeitspaket befasst sich mit den elementaren Datenstrukturen, die den Zugriff auf eine Datenbank ermöglichen. Hierbei ist ein grösseres Hindernis zu überwinden: Während die Datenstrukturen in Programmiersprachen in der Regel aus einigen wenigen Feldern bestehen, können Datenbankeinträge mehrere hundert Spalten umfassen. Dieses Problem wollen wir lösen, indem wir die Programmiersprache so erweitern, dass die Datenstrukturen flexibler definiert werden können. Das zweite Paket widmet sich der Optimierung: Wie erstellt man effizienten Code für typische Big-Data-Workloads? Beide Arbeitspakete gehen in eine Anwendung ein, die unser Konzept für eine verteilte Datenverarbeitung abbildet.

Die Programmiersprache Scala wird seit 2003 an der EPFL entwickelt. Ihre vielen positiven Eigenschaften haben Scala zur Programmiersprache für eine neue Generation von Big-Data-Softwarebibliotheken gemacht, die inzwischen von mehreren Hunderttausend Entwicklern weltweit genutzt werden. Zu den bekannteren Programmiergerüsten oder Frameworks, die in Scala verfasst sind, gehören Spark, Flink, Scalding, Summingbird und Kafka. Scala ist auch als Abfrage- und Programmiersprache für die Arbeit mit diesen Frameworks beliebt.

Wir wollen die Interaktion zwischen verschiedenen Programmiersprachen und Datenbanken verbessern. Das bedeutet nicht, dass wir spezifische Datenbankmerkmale in eine Programmiersprache integrieren (was ohnehin nicht durchführbar wäre). Wir wollen vielmehr die Philosophie von Scala – ihre Universalität – aufgreifen und herausfinden, wie man die grundlegenden Abstraktionen (Formulierungen der wesentlichen Aufgaben), die an den Schnittstellen zwischen Datenbanken und Programmiersprachen eingesetzt werden, besser formulieren und exportieren kann.

Direct link to Lay Summary Last update: 26.07.2017

Lay Summary (French)

Lead
Scala est l’un des principaux langages de programmation pour les plateformes et les outils de la science des données. Dans le cadre ce projet, nous allons travailler sur de nouveaux concepts de langage de programmation afin d’en améliorer la clarté et la facilité.
Lay summary

Le projet est divisé en différentes parties. La première s’occupe des structures de données fondamentales nécessaires à l’accès aux bases de données. Nous devons ici combler un écart d’échelle. En effet, les structures des données dans les langages de programmation n’ont généralement que peu de domaines, alors que les fichiers des bases de données peuvent avoir plusieurs centaines de colonnes. Nous allons essayer de résoudre le problème en étendant le langage de programmation de manière à pouvoir définir davantage de structures de données flexibles. Une autre partie du projet s’intéresse à l’optimisation : comment pouvons-nous générer un code efficace pour des tâches typiques du Big Data ? Tous les ensembles de tâches seront compris dans une application illustrant notre approche de traitement distribué des données.

L’EPFL développe depuis 2003 le langage de programmation Scala. Grâce à une série de caractéristiques favorables, Scala est le langage d’implémentation d’une nouvelle génération de "frameworks" (bibliothèque logicielle) Big Data utilisé par des centaines de milliers de développeurs de par le monde. Spark, Flink, Scalding, Summingbird et Kafka sont les "frameworks" les plus populaires écrits en Scala. Scala est également un langage de requête et de programmation pour travailler avec ces "frameworks".

Nous voulons améliorer les combinaisons entre langages de programmation et bases de données. L’objectif n’est pas d’intégrer les caractéristiques spécifiques des bases de données dans un langage de programmation, car ce serait infaisable. Conformément à la philosophie de Scala, qui est un langage polyvalent, nous cherchons des moyens de mieux exprimer et exporter des abstractions de programmation fondamentales (des moyens de formuler des tâches essentielles) utilisées dans les interfaces entre bases de données et langages de programmation.


Direct link to Lay Summary Last update: 26.07.2017

Lay Summary (English)

Lead
Scala is one of the leading languages for data science platforms and tools. In this project, we will work on new programming language concepts to improve the clarity and ease of use of the language in this domain.
Lay summary

The project consists of several parts. One part deals with the fundamental data structures needed for database access. Here we have to bridge a gap in scale. Data structures in programming languages typically have only a few fields whereas database records can have many hundreds of columns. We will try to solve the problem by extending the programming language so that more flexible data structures can be defined. Another part of the work deals with optimisation: How can we generate efficient code for typical Big Data workloads? All work packages will flow into an application that demonstrates our approach to distributed data processing.

The Scala programming language has been under development at EPFL since 2003. Thanks to a variety of favourable attributes, Scala is the implementation language of a new generation of Big Data “frameworks” (software libraries) used by hundreds of thousands of developers worldwide. Spark, Flink, Scalding, Summingbird and Kafka are the names of some of the more popular frameworks written in Scala. Scala is also a popular query and programming language for working with these frameworks.

We want to improve combinations of programming languages and databases. The aim is not to integrate specific database features in a programming language, which would be infeasible anyway. Instead, following Scala’s philosophy of being a versatile language, we want to research ways to better express and export fundamental programming abstractions (ways of formulating essential tasks) that are used in the interfaces between databases and programming languages.


Direct link to Lay Summary Last update: 26.07.2017

Responsible applicant and co-applicants

Employees

Publications

Publication
A practical unification of multi-stage programming and macros
Stucki Nicolas, Biboudis Aggelos, Odersky Martin (2018), A practical unification of multi-stage programming and macros, in the 17th ACM SIGPLAN International Conference, Boston, MA, USA17th International Conference on Generative Programming: Concepts & Experiences, Boston, MA, USA.
Truly abstract interfaces for algebraic data types: the extractor typing problem
Stucki Nicolas, Giarrusso Paolo G., Odersky Martin (2018), Truly abstract interfaces for algebraic data types: the extractor typing problem, in the 9th ACM SIGPLAN International Symposium, St. Louis, MO, USASCALA 18, St. Louis, MO, USA.

Collaboration

Group / person Country
Types of collaboration
IBM Spark Technology Center United States of America (North America)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
- Research Infrastructure
- Industry/business/other use-inspired collaboration

Scientific events



Self-organised

Title Date Place
No.136 Functional Stream Libraries and Fusion: What's Next? 21.10.2018 Shonan Village, Japan, Japan

Communication with the public

Communication Title Media Place Year
Talks/events/exhibitions Future-proofing Scala: the TASTY intermediate representation International 2019
Talks/events/exhibitions Metaprogramming in Dotty International 2019

Abstract

Scala has become the programming language of choice of many of today’s most popular and innovative big data frameworks. Thanks to its combination of object-oriented and functional programming, strong static type system, and position as a JVM language, Scala is the implementation language of a new generation of big data frameworks used by hundreds of thousands of developers worldwide; Spark, Flink, Scalding, Summingbird, and Kafka to name just a few.There exists a general trend of increasing confluence of programming and database technologies. The benefits of a tight integration of big data frameworks and programming languages include refined tooling (e.g. using IDEs) and rich embeddings of data analytics in complex applications. However, the combination of programming and database systems have so far largely been built atop shaky foundations. Popular frameworks for big data analytics like Spark make heavy use runtime-reflection, unofficial APIs internal to the Scala compiler and bytecode rewriting on generated code. This makes the interfaces between programming and databases poorly understood, hard to maintain, coupled to the internals of a single compiler and therefore not future proof.To move forward, we need to put combinations of programming languages and databases on better foundations. Going with Scala’s philosophy to be a scalable language, we propose to research ways to better express and export fundamental programming abstractions that are used in the interfaces between databases and programming languages.The proposed work is broken down into three orthogonal research areas.The first research area is about projecting data. Data definitions might originate in the programming language and then need to be exported to the database, or they might originate as a database schema which needs to be imported and understood in the programming language. For the first direction we will investigate how generic programming abstractions can best be embedded in Scala. We plan to adapt generic programming concepts originally developed in the Haskell context for algebraic data types to object hierarchies with case classes. In the opposite direction, we will investigate ways to evolve Scala’s structural record types so that they can adequately represent rows in a database schema. We will also plean to investigate some version of type providers to import data frames and other database schemas as types into the programming language.The second area is about projecting control. To get high performance, it is imperative to be able to reify queries as data that can be optimized and mapped to different backends. We have previously developed lightweight modular staging under an ERC advanced grant. We plan to apply what we have learned in that project to embed meta-programming techniques in Scala that are easy to use and hard to abuse. Spark has applied staging techniques in the Tungsten project, with very significant reported performance gains. Our work will help make Tungsten or the next framework like it project a larger source language in an efficient way to query optimizers.The third area is about distributed programming abstractions. Unlike traditional databases, big data frameworks are distributed, and distribution is also a key factor in related technologies such as stream processing. Writing distributed systems is currently very much of a black art, and it is exacerbated by the problem that existing low-level distributed programming models expose primitives that do not compose well. Reactive stream processing uses monadic abstractions similar to collection and database queries to model event streams. We have shown that they are a promising foundation for composable distributed protocols [9]. Another aspect of distributed big data systems is that it is often preferable from a performance standpoint to keep data stationary and send operations operating on the data instead. We plan to integrate both reactive stream processing and function serialization in a library supporting big data frameworks and applications.The aim of the proposed project is to develop and implement techniques in these three areas to improve the connection between programming and big data and to provide solid foundations for big data frameworks built in Scala.
-