Stream Processing for Everyone with Continuous SQL Queries
About four years ago, we started to add SQL support to Apache Flink with the primary goal to make stream processing technology accessible to non-developers. An important design decision to achieve this goal was to provide the same syntax and semantics for continuous streaming queries as for traditional batch SQL queries. Today, Flink runs hundreds of business critical streaming SQL queries at Alibaba, Criteo, DiDi, Huawei, Lyft, Uber, Yelp, and many other companies. Flink is obviously not the only system providing a SQL interface to process streaming data. There are several commercial and open source systems offering similar functionality. However, the syntax and semantics of the various streaming SQL offerings differ quite a lot.
In late 2018, members of the Apache Calcite, Beam, and Flink communities set out to write a paper discussing their joint approach to streaming SQL.
We submitted the paper "One SQL to Rule Them All – a Syntactically Idiomatic Approach to Management of Streams and Tables" to SIGMOD - the world's no. 1 database research conference - and it got accepted. Our goal was to get our approach validated by the database research community and to trigger a wider discussion about streaming SQL semantics. Today, the SQL Standards committee is discussing an extension of the standard to pinpoint the syntax and semantics of streaming SQL queries.
In my talk, I will briefly introduce the motivation for SQL queries on streams. I'll present the three-part extension proposal that we discussed in our paper consisting of (1) time-varying relations as a foundation for classical tables as well as streaming data, (2) event time semantics, (3) a limited set of optional keyword extensions to control the materialization of time-varying query results. Finally, I'll discuss how these concepts are implemented in Apache Flink and show some streaming SQL queries in action.