generic programming; functional programming; Scala; big data frameworks; distributed programming; programming languages
Stucki Nicolas, Biboudis Aggelos, Odersky Martin (2018), A practical unification of multi-stage programming and macros, in the 17th ACM SIGPLAN International Conference
, Boston, MA, USA17th International Conference on Generative Programming: Concepts & Experiences, Boston, MA, USA.
Stucki Nicolas, Giarrusso Paolo G., Odersky Martin (2018), Truly abstract interfaces for algebraic data types: the extractor typing problem, in the 9th ACM SIGPLAN International Symposium
, St. Louis, MO, USASCALA 18, St. Louis, MO, USA.
Scala has become the programming language of choice of many of today’s most popular and innovative big data frameworks. Thanks to its combination of object-oriented and functional programming, strong static type system, and position as a JVM language, Scala is the implementation language of a new generation of big data frameworks used by hundreds of thousands of developers worldwide; Spark, Flink, Scalding, Summingbird, and Kafka to name just a few.There exists a general trend of increasing confluence of programming and database technologies. The benefits of a tight integration of big data frameworks and programming languages include refined tooling (e.g. using IDEs) and rich embeddings of data analytics in complex applications. However, the combination of programming and database systems have so far largely been built atop shaky foundations. Popular frameworks for big data analytics like Spark make heavy use runtime-reflection, unofficial APIs internal to the Scala compiler and bytecode rewriting on generated code. This makes the interfaces between programming and databases poorly understood, hard to maintain, coupled to the internals of a single compiler and therefore not future proof.To move forward, we need to put combinations of programming languages and databases on better foundations. Going with Scala’s philosophy to be a scalable language, we propose to research ways to better express and export fundamental programming abstractions that are used in the interfaces between databases and programming languages.The proposed work is broken down into three orthogonal research areas.The first research area is about projecting data. Data definitions might originate in the programming language and then need to be exported to the database, or they might originate as a database schema which needs to be imported and understood in the programming language. For the first direction we will investigate how generic programming abstractions can best be embedded in Scala. We plan to adapt generic programming concepts originally developed in the Haskell context for algebraic data types to object hierarchies with case classes. In the opposite direction, we will investigate ways to evolve Scala’s structural record types so that they can adequately represent rows in a database schema. We will also plean to investigate some version of type providers to import data frames and other database schemas as types into the programming language.The second area is about projecting control. To get high performance, it is imperative to be able to reify queries as data that can be optimized and mapped to different backends. We have previously developed lightweight modular staging under an ERC advanced grant. We plan to apply what we have learned in that project to embed meta-programming techniques in Scala that are easy to use and hard to abuse. Spark has applied staging techniques in the Tungsten project, with very significant reported performance gains. Our work will help make Tungsten or the next framework like it project a larger source language in an efficient way to query optimizers.The third area is about distributed programming abstractions. Unlike traditional databases, big data frameworks are distributed, and distribution is also a key factor in related technologies such as stream processing. Writing distributed systems is currently very much of a black art, and it is exacerbated by the problem that existing low-level distributed programming models expose primitives that do not compose well. Reactive stream processing uses monadic abstractions similar to collection and database queries to model event streams. We have shown that they are a promising foundation for composable distributed protocols . Another aspect of distributed big data systems is that it is often preferable from a performance standpoint to keep data stationary and send operations operating on the data instead. We plan to integrate both reactive stream processing and function serialization in a library supporting big data frameworks and applications.The aim of the proposed project is to develop and implement techniques in these three areas to improve the connection between programming and big data and to provide solid foundations for big data frameworks built in Scala.