How to use subtractByKey() function In PySpark
Blog post description.
In PySpark, subtractByKey() is a function used to remove elements from one RDD (Resilient Distributed Dataset) based on the keys that are present in another RDD. This operation is performed on Pair RDDs, where each element is a key-value pair.
Here's a step-by-step explanation of how to use the subtractByKey() function using a sample dataset of 10 rows and removing elements from another RDD:
Let's assume you have two Pair RDDs: rdd1 and rdd2.
# Import the necessary libraries
from pyspark import SparkContext, SparkConf
# Create a SparkConf and SparkContext
conf = SparkConf().setAppName("subtractByKeyExample")
sc = SparkContext(conf=conf)
# Create the sample data for rdd1 and rdd2
data1 = [(1, "apple"), (2, "banana"), (3, "cherry"), (4, "date"), (5, "elderberry")]
data2 = [(2, "banana"), (4, "date")]
# Create the Pair RDDs
rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)
# Use subtractByKey() to remove elements from rdd1 based on keys present in rdd2
result_rdd = rdd1.subtractByKey(rdd2)
# Collect and print the result
result = result_rdd.collect()
print(result)
# Stop the SparkContext
sc.stop()
In this example, we have two Pair RDDs: rdd1 and rdd2. The subtractByKey() function is applied to rdd1 using rdd2 as an argument. This operation removes elements from rdd1 that have keys present in rdd2.
The output will be:
[(1, 'apple'), (3, 'cherry'), (5, 'elderberry')]
As you can see, the elements with keys 2 and 4 were present in both rdd1 and rdd2, so they are removed from the result RDD.
Remember that PySpark's transformations are lazy, meaning they are not executed immediately but rather when an action is performed. In the example above, the collect() function triggers the execution of the transformation, and the result is returned as a list.
Keep in mind that RDDs are immutable, so the subtractByKey() operation does not modify the original RDDs but creates a new RDD with the desired elements removed.