Management Summary
This article presents five technologies that every data engineer should know and master for his daily work. Spark as a data processing tool in the big data environment, Kafka as a streaming platform, Airflow, and serverless architecture for coordination and orchestration are presented. Before that, the importance and role of SQL (Structured Query Language) and relational databases will be discussed.
Despite the constant change, SQL has taken on a special position and can be found as an interface in new developments. However, it is by no means the case that, as decades ago, a stable knowledge of this query language is sufficient to handle the majority of data work. On the one hand, the data landscape and, on the other hand, the processed data have become too heterogeneous. In many cases, it is no longer sufficient to update data according to a time-defined schema or the data volumes to be processed are simply no longer shouldered by classic ETL (Extract-Transform-Load) processes in databases.
Data streaming platforms such as Apache Kafka are an answer to the real-time problem, with the ability to scale with requirements and provide fail-safe operation. Regarding the possibility of processing enormous amounts of data, Spark is the tool of choice. Just like Kafka, it can scale with the requirements. Furthermore, it offers a generous selection of implementation languages.
The heterogeneity of the data stores and processing frameworks used can be answered with two approaches: a central (global) orchestration that can serve the various components, as Airflow does impressively; a decentralized (local) solution that only reacts to specific signals from individual components, as the serverless approach does. All current cloud providers can provide such a service. In summary, you will find a mixture of tools to answer the current questions in data engineering.
Intro
The data landscape has been characterized by increasing dynamics over the past decades. Until the turn of the millennium, data warehouses and relational database management systems (RDBMS) were still considered the gold standard for data storage and preparation. Still, the spread and dynamic of the Internet, in particular, broke through this unique position. On the one hand, the amount of data multiplied, and on the other hand, interest was now increasingly focused on semi-structured and unstructured data. So, we had arrived in the Big Data age. The next push was caused by the data flows generated by mobile devices and sensors (keyword: Internet of Things). It was no longer just a matter of coping with the enormous increase in data input but recognizing events from many data points in real-time and react to them. Finally, cloud computing has unlocked additional potential for data analysis and processing, as infrastructure is now available at low cost in many respects and various technologies can be used with a low entry threshold. The initially monolithic world of databases has evolved into a heterogeneous and dynamic data landscape that requires data engineers to meet the requirements.
With the help of five technologies, this article aims to provide an orientation for solving current problems in “Data Engineering”. In the areas “Batch Processing” and “Streaming”, the established technologies Apache Spark and Apache Kafka are presented. With Apache Airflow and the serverless technology Lambda, two different concepts are presented to control processes. Finally, contrary to the introduction’s thrust, there is also a chapter on SQL and relational databases.
* Data Engineering is a very narrow term here, namely the activity of establishing a continuous data supply and preparation.
SQL und relationale Datenbanksysteme
SQL is an abbreviation for structured query language and is an integral part of relational database systems. Contrary to the general tone and despite various (further) developments of NoSQL [1] solutions, SQL systems continue to play a major role in modern data architecture. This is also shown by a survey from the year 2019, which can be summarized as follows: In most cases, the use of NoSQL systems is not a turning away but an addition to existing systems that rely on SQL.
SQL’s popularity is also reflected in modern frameworks: The developments in Spark and Kafka (see both in the following paragraphs), for example, bear witness to the importance given to SQL. Finally, some tools make SQL queries for NoSQL systems compatible, Apache Drill being one example.
One reason for SQL’s popularity, apart from its advantages as simple and semantically similar to the English language, is its wide distribution. Even outside the database milieu, there are analysts and employees in reporting who have mastered SQL.
SELECT
DEPARTMENT,
MANAGER,
COUNT(USER_ID)
FROM TBL_EMPLOYEE
WHERE IS_EXTERNAL = 1
GROUP BY DEPARTMENT, MANAGER
Clearly readable, this query carries out a selection, filtering, and grouping of data.
Relational databases are closely related to SQL. This mature technology stands out for its consistency and durability. In addition, the table schema must be defined before the first write, which leads to expectability and security, but can also be seen as complex to manage and rigid. The handling of data whose structure cannot be explicitly specified or is changeable can be difficult in relational databases. The same is true for complex structured data because these data have to be fitted into tables and relations as well so that they can be handled effectively by the database.
Advantages of relational databases:
- A mature technology developed since the 1970s and therefore mastered by many experts
- A strong typification and the definition compulsion a priori of table schemas guarantee expectability and security
- With SQL, a widespread, understandable query language
- The implementation of the ACID schema, which guarantees consistency, security, and durability of data states
- Low-redundancy memory consumption through normalization and reference possibilities
Spark
Almost a classic and already matured to version 3, Spark is the standard for efficiently processing very large data volumes. The performance is mainly based on two pillars: On the one hand, processing steps are distributed to a group of worker nodes, making it possible to process large data volumes in parallel. On the other hand, it is an intelligent system for keeping intermediate results in the working memory, shortening calculation distances, and reducing access times.
Processing by the Spark-Cluster is only triggered when the route is defined end-to-end (from the initial data via transformations to the final product). As indicated, Spark tries to execute the task load as parallel as possible on the worker nodes. Similar to a query optimizer in a relational database, Spark is looking for a high-performance way to split the definition into individual steps and distribute these steps to the workers as tasks.
A simple example will illustrate this schema. It is assumed that data from a large number of files are processed, which are available in a text format. Various transformations are performed to extract important information. From a second data source, which is also in a raw state in text format, further information shall be extracted and enriched to the intermediate product.
The figure below, which Spark generates internally, can be observed:
The initial data, which is available as text files, is read in at the beginning of Stages 8 and 10 and then transformed, which is reflected in the actions “map” and “distinct”. Finally, the results are combined with a “join” action (as known from SQL) in Stage 10.
Another advantage of Spark is its compatibility with various data formats. Spark allows read and write operations in JSON, XML, CSV, Avro, among others, but can also handle database connections – be it classic RDBMS or big-data databases like Hive.
Although the engine is based on Java, Python (in the form of the PySpark library) has long since become the most widely used primary implementation language. Developments such as Koalas, which integrates the popular data analysis library Pandas into Spark, further underline this. Other supported languages are R, Scala, and SQL.
Spark’s primary strength is the processing of defined processing jobs, in all facets and colors. For use cases, which expect low latencies on data requests, it is better to use other tools, such as Presto, Impala, or Dremio.
Advantages of Spark:
- Processes large amounts of data (TB range) very efficiently
- Scales by adding more worker nodes to the cluster
- Supports many data formats and connections to databases
- Processing queries can be written in SQL, R, Scala, Java, and Python
Spark is available in all possible editions and sizes:
- Jupyter notebooks with PySpark and local Spark context (for example, as docker container)
- On-premise solutions like Cloudera or Hortonworks
- Notebook centered cloud solution from Databricks
- Cloud clusters: EMR in AWS, HDInsights in Azure, Dataproc in GCP
Kafka
Kafka is a distributed, error-tolerant, and high-performance streaming platform. Kafka often comes into play when high data volumes need to be processed in real-time. The platform is also used for use cases in which a large number of heterogeneous systems with mutual dependencies occur. Unlike ordinary data stores, by definition, a stream is not exhausted and operations on it are equally continuous.
Messages are treated as key-value pairs and organized into topics. In turn, topics are divided into partitions, which are kept redundantly and thus secured against failure on several nodes. And finally, there are participants at both ends: Consumers and producers, who read from and write to the topics.
The figure shows the place Kafka occupies in the system, namely as a data hub: From below, the source systems mainly deliver into the Kafka system (producers) and from above, the (thus processed) streams are consumed (consumers).
Through the streaming API and the development of ksql, which allows us to execute SQL queries directly on streams, a high-level interaction possibility is available to interact with streams and to build new ones from existing ones. All transformations are represented that are also known from other processing frameworks (like Spark) or database SQL dialects: filtering, grouping, joining, mapping, etc. In addition, there is the streams immanent timeline, which can be handled with the windowing functions, i.e., the questions about how many events per time period have taken place, e.g., clicks on elements of web pages.
The advantages of Kafka:
- Enables analyses and operations in real-time
- Integrates heterogeneous systems into a central data hub (Producer and Consumer API)
- Scalable and fail-safe due to redundant keeping of partitions
- Provides ksqldb, a technology to transform streams with SQL queries
A good starter kit can be found here. There are also services in all current cloud platforms (Amazon MSK, Azure HD Insight, …).
Airflow
A further topic is the structuring of the sequence of the individual data preparation and data processing steps. Especially with a high number of involved and high heterogeneity in the components, it is highly advisable to use an orchestration tool to guarantee a stable process. For modern orchestration tools, there is a broad catalog of requirements, which can be divided into the following categories:
- Controllability: The components and steps involved are well defined. The execution should be comprehensible, both in running and in the follow-up.
- Integrability: Various and usually very heterogeneous components must be addressable, and their execution should be observable.
- Reactivity: Results of individual steps are to be evaluated and flow into the expiration process. For example, it has to be defined, what will happen in the case of a whole step fails: a repetition, a deviation from the normal process way, or ignoring?
- Scalability: The growth of the process structure can be compensated as easily as possible by extending the execution infrastructure.
Apache Airflow proves to be the optimal candidate for these requirements. Complex processes are described as directed acyclic graph (DAG); the steps are linked together as nodes by conditions. These DAGs form an executable unit, and their execution can be automated. The description is written purely in Python. This has the advantage that the development is fast, and operators can be written easily. Most of the time, this is unnecessary because Apache Airflow already contains many different and common operators. Among them are: Triggering Spark Jobs, providing cloud infrastructure, and executing database queries.
The example below shows an exemplary DAG process, where several strings are executed in parallel and finally merged.
The figure shows how DAGs can take several directions. The last step evaluates the results from both predecessors and merges the strands again.
When it comes to scalability, Airflow provides two solutions: The Celery Executor distributes tasks via a Message Broker to the worker processes registered there. Kubernetes can be connected as an execution platform via the Kubernetes Executor.
Finally, the management of access to external systems (be it databases or cloud computing instances) is to be addressed. For this purpose, there is the so-called Secrets Manager to store the connection keys to the systems clearly, centrally, and encrypted.
The advantages of Airflow:
- Complex processes are described and controlled in DAGs
- Airflow offers a cornucopia of operators for communication with external systems
- The execution can be configured scalable using Celery and Kubernetes Executors
- Secrets from databases and other systems are centrally managed by Airflow and can be referenced in code
There are ready-to-use docker images available to try out Airflow. But installation and setup are also quite simple. In addition, the Google Cloud also offers the managed Airflow Service Cloud Composer, with which you can start immediately.
Serverless
Finally, a completely different approach to flow control has to be mentioned here, an approach that can be assigned to event-driven architecture. Instead of a central control unit that orchestrates the processes, it is defined locally with which processes to react to specific events. This could be a change in the data lake, an infrastructure event such as providing a computing instance, or simply a time event. On the function call side, typical examples are the triggering of data ingestion, the execution of a table update, or the sending of a notification. By the local configuration of the reactivity, a process complex is assembled.
The figure shows the serverless architecture schematic, taken from OpenWhisk, an open-source serverless platform. A feed provides the event flow. A rule connects the occurrence of an event (trigger) with the action.
Another special feature is the serverless paradigm. According to this paradigm, there is no need to provide computing resources. Instead, the underlying environment (e.g., a cloud infrastructure) determines how resources are allocated and where code is executed. Kubernetes is one representative of such an application for managing and deploying resources. The developer is thus offered the convenience of focusing entirely on development. Furthermore, the execution is language-agnostic and all common modern programming languages are supported. Thus, executions in different languages can coexist and can also be coupled with each other.
Some examples for the application:
- A new file has been stored in the ADL (Azure Data Lake) and data ingestion should be triggered.
- A table content of the No-SQL-Database DynamoDB was changed and this should be written into the archive (this corresponds to triggers in databases).
- Time events: Every hour, data should be provided for model training (comparable to cron jobs in the server centered world).
The advantages of serverless architecture:
- No expenses for infrastructure provision; instead, the focus is on the implementation of the functionality
- Event-driven instead of central orchestration
- Functions can be written in different programming languages
- Functions of different languages can coexist
The providers are usually found in the cloud: Lambda in AWS, Cloud Functions in GCP, Azure Functions in the Azure Cloud – without wanting to conceal the open source project OpenWhisk.
Conclusion
Despite the changes and breaks, SQL continues to be the central interface to data, both in terms of queries and processing. SQL may even survive the world of relational databases, where it originally came from.
On the other hand, it can also be stated that it is no longer sufficient to master SQL to meet modern problems, as was the case in the former world of data warehouses. If one leaves the standard of functionality in Spark or Kafka, which SQL provides to implement specific functions (one speaks here of User Defined Functions), it requires additional knowledge in programming languages such as Python or Java. Similarly, as shown in the previous sections, working with Airflow or serverless technology requires mastery of the Python programming language – or, in the case of serverless, any other modern language.
If you choose five technologies, others will be dropped in return, which should nevertheless be mentioned here:
- MongoDB as a representative of document databases (NoSQL), which can shine with flexibility and scalability
- Apache Cassandra as a representative of key-value databases (NoSQL), which above all stands for high scalability
- Apache Flink as data streaming framework in addition to Apache Kafka
- Apache Beam as an abstraction layer for data pipeline definition, which can then be executed in Apache Flink or Apache Spark
[1] NoSQL means first and foremost a departure from the SQL paradigm and such databases do not form a homogeneous class. Instead you can roughly define three subclasses: graph databases, document databases and key-value databases.