How to use subtractByKey() function In PySpark

Blog post description.

8/27/20231 min read

In PySpark, subtractByKey() is a function used to remove elements from one RDD (Resilient Distributed Dataset) based on the keys that are present in another RDD. This operation is performed on Pair RDDs, where each element is a key-value pair.

Here's a step-by-step explanation of how to use the subtractByKey() function using a sample dataset of 10 rows and removing elements from another RDD:

Let's assume you have two Pair RDDs: rdd1 and rdd2.

# Import the necessary libraries

from pyspark import SparkContext, SparkConf

# Create a SparkConf and SparkContext

conf = SparkConf().setAppName("subtractByKeyExample")

sc = SparkContext(conf=conf)

# Create the sample data for rdd1 and rdd2

data1 = [(1, "apple"), (2, "banana"), (3, "cherry"), (4, "date"), (5, "elderberry")]

data2 = [(2, "banana"), (4, "date")]

# Create the Pair RDDs

rdd1 = sc.parallelize(data1)

rdd2 = sc.parallelize(data2)

# Use subtractByKey() to remove elements from rdd1 based on keys present in rdd2

result_rdd = rdd1.subtractByKey(rdd2)

# Collect and print the result

result = result_rdd.collect()

print(result)

# Stop the SparkContext

sc.stop()

In this example, we have two Pair RDDs: rdd1 and rdd2. The subtractByKey() function is applied to rdd1 using rdd2 as an argument. This operation removes elements from rdd1 that have keys present in rdd2.

The output will be:

[(1, 'apple'), (3, 'cherry'), (5, 'elderberry')]

As you can see, the elements with keys 2 and 4 were present in both rdd1 and rdd2, so they are removed from the result RDD.

Remember that PySpark's transformations are lazy, meaning they are not executed immediately but rather when an action is performed. In the example above, the collect() function triggers the execution of the transformation, and the result is returned as a list.

Keep in mind that RDDs are immutable, so the subtractByKey() operation does not modify the original RDDs but creates a new RDD with the desired elements removed.

How to use subtractByKey() function In PySpark

PLUGGAI