Apache Spark’s architecture

Aniruddh Yadav
3 min readMay 10, 2024

--

Cluster: A cluster is a collection of machines that work together to process data. In Spark, we can have standalone clusters or those managed by a cluster manager like Mesos, YARN, or Kubernetes.

Node: A node is a single machine in a cluster. There are two types of nodes in Spark:

Master Node: This is where the driver program runs. It contains your SparkContext.
Worker Node: These are the nodes that run the executor processes which carry out the tasks.

Spark Driver: The Spark driver is the program that declares the transformations and actions on data and submits such requests to the cluster. The driver program runs on the master node.

Executor: Executors are JVM processes that run on worker nodes in the cluster and are responsible for executing tasks. Each executor has its own JVM and runs multiple tasks in separate threads. Once a job is submitted, it’s divided into stages, and these stages are further divided into tasks. These tasks are then sent to the executors for execution. Each executor runs multiple tasks in separate threads.

Executors reside on worker nodes and have two key roles:

Execute tasks: Executors receive tasks from the driver program and run them. They execute tasks concurrently, which allows Spark to process large amounts of data.
Store and cache data: Executors can store intermediate data for tasks in memory or on disk. This is particularly useful for iterative algorithms, as it allows data to be quickly accessed on subsequent passes.

Job: A job in Spark is a sequence of transformations on data with a final action. When an action is called on a RDD (Resilient Distributed Dataset), it triggers a job.

DAG (Directed Acyclic Graph): DAG is a sequence of computations performed on data. In Spark, a DAG represents a sequence of transformations on data where each node is an RDD partition, and the edge is a transformation on top of data.

Lineage Graph: Lineage graph is a kind of DAG. It maintains the information about the parent RDD of an RDD. It is used to recover lost data partitions due to failure of a worker node.

Here’s a step-by-step breakdown of how the process works:

1. **Starting the Application**: When we submit a Spark application, a driver program starts on a node in the cluster. This node becomes the master node. The driver program contains the main() function of our Spark application.

2. **Registering with the Cluster Manager**: The driver program then registers with the cluster manager. The cluster manager is responsible for the allocation of resources across the applications. It knows which nodes are part of the cluster and how much resources each of them has.

3. **Starting Executors**: Once the driver program has registered with the cluster manager, the cluster manager starts executor processes on the worker nodes in the cluster. The number of executors started depends on the configuration settings of our Spark application and the number of resources available in the cluster.

4. **Executing Tasks**: The driver program splits the Spark application into tasks and sends them to the executors. The tasks are then executed on the worker nodes.

5. **Communicating Results**: After the tasks are completed, the executors return the results to the driver program. If the driver program has declared an action (like `count()` or `first()`), it will return the results of that action to the user.

Here’s a simple example:

from pyspark.sql import SparkSession
# Create a SparkSession (Driver program starts)
spark = SparkSession.builder.getOrCreate()
# Load data into DataFrame (This is an action and triggers a job)
df = spark.read.csv(“data.csv”, header=True, inferSchema=True)
# Transformation (This creates a new stage in the lineage graph)
df = df.filter(df[“age”] > 30)
# Action (This triggers another job)
df.show()

In this example, when we run this script, it starts a driver program (`SparkSession.builder.getOrCreate()`). The driver program then registers with the cluster manager to start executor processes (`spark.read.csv()`). The driver program splits the job into tasks and sends them to the executors. The executors execute the tasks and return the results to the driver program (`df.show()`). The driver program then collects the results and returns them to the user.

Feel free to reach out to me over LinkedIn in case of any issues or feedback https://www.linkedin.com/in/aniruddhyadav

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Aniruddh Yadav
Aniruddh Yadav

Written by Aniruddh Yadav

Data Engineer with experience in solutioning, building data-intensive applications, tackling challenging architectural and scalability problems.

No responses yet

Write a response