CMS - All About Apache Spark Accumulators in Plain English

Introduction

Apache Spark is an open-source distributed computing system that is designed to process large datasets in parallel. One of the key features of Spark is its ability to handle distributed data processing. In order to achieve this, Spark provides a number of built-in mechanisms, including accumulators. In this blog post, we will explore the concept of accumulators in Apache Spark, their purpose, and how they can be used in Spark applications.

What are Accumulators?

An accumulator is a shared variable that can be used in parallel to accumulate data across multiple tasks in a distributed system.

Accumulators are used to implement counters and sum in Spark applications. Spark provides two types of accumulators: mutable and immutable. Immutable accumulators can only be read by tasks, while mutable accumulators can be both read and written to by tasks.

Simply in a nutshell,

Spark Accumulator is a global mutable variable that a Spark cluster can safely update on a per-row basis. Can use this to implement counters or sums.

Types of Accumulators in Spark

Spark provides two types of accumulators:

Mutable — both read and written to by tasks.
Immutable — can only be read by tasks

Mutable accumulators are used when we need to perform a count or sum operation on our data.

For example, if we want to count the number of records in our data, we can use a mutable accumulator to increment the counter for each record processed by our Spark application.

Immutable accumulators, on the other hand, are used when we need to track a value that is read-only.

For example, if we want to calculate the average value of a dataset, we can use an immutable accumulator to calculate the sum of the values and the count of records processed by our Spark application. Once the calculation is complete, we can read the final average value from the accumulator.

Advantages of an Accumulator

The same results can get by the aggregation functions such as count and sum. But these aggregations are introduced another shuffle exchange to the solution. This is not good for the performance of the application. To avoid this kind of performance degradation factor from the application it’s recommended to use accumulators.

How to Use Accumulators in Spark Applications

To use accumulators in a Spark application, we need to first create an accumulator object and then use it in our tasks. Here is an example of creating and using a mutable accumulator:

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("Accumulator Example")
sc = SparkContext(conf=conf)

sum_acc = sc.accumulator(0)

def process_data(data):
global sum_acc
sum_acc += data

data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
rdd.foreach(process_data)

print("Sum of data: ", sum_acc.value)

In this example, we create a mutable accumulator object called sum_acc and initialize it to 0. We then define a function called process_data which takes in a data element and adds it to the accumulator. Finally, we create an RDD from the data and call the foreach function on it with the process_data function passed as an argument. This causes each element in the RDD to be processed by the process_data function, which adds it to the accumulator. Finally, we print the value of the accumulator.

This accumulator variable is maintained at the driver. Because of no need to collect from anywhere else. Always live on the driver. All the tasks will increment the value of this accumulator using internal communication with the driver.

Where to use Spark Accumulators

Technically we can use apache spark in two places,

Within the Transformations — ie. “withcolumn” functions
Within the Actions

Due to the guaranteed accuracy of the results, it’s always recommended to use accumulators inside the actions instead of transformations.

Spark tasks are failed for varieties of reasons and it keeps retrying on different workers to make it a success. This will create duplicate tasks. When the task is running multiple times on the same data it will increase the accumulator multiple times and destroy the counter.
But when incrementing the accumulator inside an action spark guarantees accurate results.

Conclusion

Accumulators are an important feature of Apache Spark that allows us to perform distributed calculations on large datasets. They provide a simple and efficient way of accumulating data across multiple tasks in a distributed system. By using accumulators in our Spark applications, we can perform complex calculations on large datasets with ease.

In conclusion, accumulators are a powerful tool for data scientists and engineers who work with large datasets and need to perform complex calculations. If you’re new to Spark, be sure to explore the many other features that Spark has to offer, such as ingestion through structured streaming, performance tuning, and advanced concepts. With the help of accumulators, you can make the most of Spark’s distributed computing capabilities and tackle even the largest datasets with ease.

Author: Ishan Bhawantha is an Associate Tech Lead in Data Engineering at CMS. He works for an NYSE-listed US company working on Big Data related operations.

Share Post