Is this feature a revolution in Spark?
Process data in Spark by making an API request instead of submitting an application. Everything you need to know about Spark Connect.
I will publish a paid article every Tuesday. I wrote these with one goal in mind: to offer my readers, whether they are feeling overwhelmed when beginning the journey or seeking a deeper understanding of the field, 15 minutes of practical lessons and insights on nearly everything related to data engineering.
Intro
Apache Spark has introduced a new approach for developing data processing applications, promising reduced operational overhead compared to traditional methods.
It is Spark Connect.
I heard about this feature a few months ago and spent time examining it more closely when I wrote an article about PySpark. However, I did not fully understand this feature at that time, so I only left some highlights in that article.
This time, I’m confident in providing you with more details about this feature after spending a considerable amount of time playing with it and reading the code from the Spark GitHub repo.
In this article, I aim to explain Spark Connect in the most straightforward way possible. I found that resources on the Internet (even the Spark official documentation) cannot provide me with all the answers I want when researching Spark Connect. So, I hope my work here can save others time who are learning about this feature.
First, let’s revisit some Spark fundamentals.
Spark cluster vs Resource cluster
A Spark application consists of
Driver: This JVM process manages the entire Spark application, from handling the user application, planning the execution, to distributing tasks to the executors.
Executors: These processes execute tasks the driver assigns and report their status and results. Each Spark application has its own set of executors.
I will refer to the Driver-Executors as the Spark cluster. Each Spark cluster is associated with a Spark application.
There is another cluster that is responsible for providing physical resources for the Spark cluster's operation. I will refer to this as the resource cluster. This cluster is a set of physical servers that are managed by the cluster manager. Spark can work with several resource clusters such as YARN, Mesos, or Kubernetes. The driver will communicate with the cluster manager to allocate resources to the executors.
Cluster Mode vs. Client Mode vs. Local Mode
Spark has different modes of execution, which are distinguished by the location of the driver process.
Cluster Mode: The driver process is launched on the resource cluster alongside the executor processes in this mode.
Client Mode: The driver remains on the client machine. This setup requires the client machine to maintain the driver process throughout the application’s execution.
Local mode: This mode runs the entire Spark application on a single machine, achieving parallelism through multiple threads. It’s commonly used for learning or testing.
Anatomy
It’s crucial to understand how Spark manages the workload:
A Spark Application consists of a driver program and a set of executors on a cluster. As discussed earlier, a Spark application is associated with a Spark cluster.
Job: In Spark, a job represents a series of transformations applied to data and is only triggered by an action such as count(), collect(), or show(). A single Spark application can have more than one Spark job.
Stage: A stage is a job segment executed without data shuffling. Spark splits the job into different stages when a transformation requires shuffling data across partitions.
Task: A task is the smallest unit of execution within Spark. Each stage is divided into multiple tasks, which execute processing in parallel across different data partitions.
After revisiting some Spark basics, we will continue with Spark Connect. I will delve into the details of this feature, using the knowledge we’ve discussed to ensure everything connects seamlessly.
This article is sponsored by Astronomer. Even the most advanced Airflow users encounter DAG errors and task failures. That’s why we wrote Debugging Apache Airflow® DAGs—a guide written by practitioners, for practitioners. Inside, you'll find everything you need to know to solve issues with your DAGs:
✅ Identifying issues during development
✅ Using tools that make debugging more efficient
✅ Conducting root cause analysis for complex pipelines in production
Spark Connect
Overview
Traditionally, the Spark driver process must perform a significant amount of work, ranging from running the client application to scheduling the actual data processing. In client mode, users must maintain the full Spark dependencies and ensure they are compatible with those running on the submit destination.
Spark Connect offers a decoupled client-server architecture for Spark by separating the driver process from the client and making the client thinner. There will be a dedicated server that hosts a long-running Spark application (Spark cluster) and exposes a gRPC endpoint to accept client requests. The ultimate goal is to make the client far thinner compared to the traditional approach.
High-level flow
At the high level, here are things that happen with Spark Connect:
When the Spark Connect server is started, a Spark cluster is created here. The resource for this Spark cluster can be configured like any other cluster when users run the command to start the Connect Server. The Spark cluster is kept running as long as the server is alive.
There is a gRPC connection between the client and the Spark Connect Server (the Spark driver). Each client will have its own session when communicating with the server.
Note: The session here is not the Spark Session object.
For each Spark job (e.g., df.show(), df.collect()…), the client converts its DataFrame query to an unresolved logical plan that describes the intent of the operation. An important note is that the client can’t adjust the CPUs or RAMs for the remote Spark application. They will share the resource with other clients
This plan is encoded using protocol buffers (so it can be language agnostic) and sent to the server.
When the server receives the plan, the driver analyzes, optimizes, and converts it to a physical plan, and schedules the execution on the executors.
The results are sent back to the client as Apache Arrow record batches (also via the gRPC connection)
Thinner Client
The clients no longer need to maintain the full Spark dependencies, as they don’t need to initiate the driver process. The client library only needs to embed the Spark Connect API, which is built on the DataFrame API and enables communication via gRPC, allowing it to convert user input into an unresolved logical plan and send it to the remote driver for execution.
For example, the Python Spark Connect client only requires the Python library and does not rely on any non-Python dependencies, such as jars and JRE for Spark.
In simple terms, Spark Connect to me is running the Spark application in client mode, except that we don’t need to manage the driver process ourselves; we send the Spark jobs to a shared, remote driver managed by the Connect server.
How it help
The driver lives separately, so we need to allocate much fewer resources for the client, who now only needs to implement the lightweight Spark Connect API.
The thin Spark Connect API also enables developers to accelerate support for Spark applications in other programming languages.
It’s also more straightforward for developers to debug the Spark job in their IDE, as the process now resembles calling a backend server.
The Spark Connect design is well-suited for use cases that require a low warm-up time or an interactive experience. The driver initiation process clearly impacts the warm-up time, and the way to achieve interactiveness is by maintaining the long-running driver process on our laptop. With Spark Connect, the server manages the life cycle of the driver process on our behalf.
The Spark driver can now be upgraded independently of the client applications. This means the server can receive performance improvements and security fixes without requiring all applications to be updated simultaneously, as long as the communication protocols remain compatible between the client and the server.
Limitation
The first limitation is that Spark Connect supports fewer APIs compared to traditional Spark, as it’s built around the DataFrame API; notable exceptions include the RDD and Spark Context APIs.
The second limitation I observed is that the user can only specify the resource for the remote Spark cluster at the time the Spark Connect server is run. The client communicating with the server via gRPC can’t adjust the CPUs or RAMs according to their needs.
All Spark jobs running on the Spark cluster have to share the resources of the server's Spark cluster. A heavy job will undoubtedly affect other jobs. This could be the problem if you need a higher isolation level or a more customized resource profile for your Spark application.
Playground
To experiment with the Spark Connect, I prepared a Docker Compose script. You can use it to test yourself with the Git repo here.
When you run the Docker Compose file, here are the components that will be created:
A standalone Spark resource cluster with the master is started with the script `start-master.sh`, and a set of 3 workers is started with the script `start-worker.sh`. Each worker will have the capacity of 2 CPUs and 1 GB RAM.
A Spark connect server that is started with the script `start-connect-server.sh` with some configurations for the Spark cluster: the resource cluster master (cluster manager), and the desired executor resource. The connect server is exposed to two ports: port 4040 is used for the UI, and port 15002 is used to accept client requests.
After all the services are running, you can visit the `localhost:8080` for the Spark resource master UI and `localhost:4040` for the Spark Connect server UI. You can see that there is a running Spark application in the Spark master UI; this is the one initiated by the Spark Connect server.
Now, you can connect to the Connect server (port 15002) via PySpark. This is very similar to when you work with the traditional Spark application, except that you can’t call the `master` method and set the resource configuration (CPUs, RAMs) when constructing the SparkSession object; instead, we must call the `remote` method to set the Spark Connect URI.
An important note is that although we still use the SparkSession class for the Spark Connect, the return object is not actually the SparkSession object when we run the Spark application. You can still use this object to create the dataframe and display the result, but the underlying processes are pretty different. The SparkSession no longer needs to be initialized. Instead, a session is created for us to work with the Spark Connect server. Each client can have a separate session.
When we run a command like this:
The library will transform the logic into an unresolved logical plan, encode it using protobuf, and utilize the session to transfer it to the Connect server. The result is transferred back to the user by using this session as well.
I believe the creators behind the Python Spark Connect client aim to provide a better development experience, as we only need to modify a small amount of code when working with Spark Connect.
Outro
In this article, we first revisit some Spark fundamentals. Then, we delve into Spark Connect, exploring what it is and how it differs from the traditional Spark application. Next, I list out aspects it could help with and things we need to consider when deciding to adopt this feature. Finally, I'd like to share a small experiment to try with Spark Connect.
For me, Spark Connect won’t replace the way we have developed Spark applications over the last 10 years any time soon; however, it opens the doors to more use cases (such as an IoT device that can send requests to the server). Spark Connect also enables support for additional programming languages more quickly, as developers only need to embed the lightweight Spark Connect API.
Thank you for reading this far. See you next time.
Reference
[2] Sergey Kotlov, Adopting Spark Connect, 2024
That’s cool and I tested it via the kubeflow’s spark-operator, it was a bit difficult to make it work with connection to my iceberg catalog. One thing that would be so cool is if the spark-connect driver could scale down to 0 instead of being idle, and have a kubernetes operator to add/remove spark-connect drivers!