Introduction
Welcome to the first exciting week of exploration into Data Engineering! In this week’s edition, let’s delve into the fascinating world of Apace Spark Core APIs and uncover their hidden potential. Get ready to unleash the magic of Spark, where data processing meets lightning speed.
Understanding the data
Consider a business that wants to gain specific insights from its ‘orders’ data. The data sample for ‘orders’ looks like this:
We need to enter into SparkSession, the grandmaster of Spark operations, unifying our data processing experience under one roof.
Load the data into Resilient Distributed Dataset (RDD). Think of RDDs as the LEGO blocks of Spark – versatile, robust, and ready to build amazing data structures.
PySpark Use Case 1
Count the orders under each status
a. Transformations are like a magician’s tricks – turning mundane data into spectacular insights.
b. Apply transformations to aggregate values on the basis of the key.
c. Sort the RDD’s values using sortBy transformation and then collect the final result.
Find the premium customers (Top 10 who placed the greatest number of orders)
a. Apply map, reduceByKey and sortBy transformations and perform the take action for collecting top 10 values.
Distinct count of customers who placed at least one order
a. Use distinct transformation and count action for addressing this requirement.
Which customer has the maximum number of closed orders?
a. Perform filter, map, reduceByKey and sortBy transformations as follows:
PySpark Use Case 2
Develop the logic to find the frequency of each word in a 10TB file.
- Firstly, develop the logic around a small sample data.
- Now, test the logic on a small dataset. Use parallelize() to create an RDD from local, small data.
- The same logic can now be applied on a large dataset (10 TB file).
Checking the number of RDD partitions
As a general rule, number of RDD partitions is equal to the number of blocks in HDFS (with default block size of 128MB).
The file size in the following example is less than 128MB. So, number of blocks in HDFS and hence, number of RDD partitions should be 1.
But, the value for property defaultMinPartitions is 2, so getNumPartitions property gives 2 partitions.
countByValue
countByValue is an action that gives the same output as map and reduceByKey combined.
On the results given by countByValue, no further parallel processing can be performed because it is an action and its output is stored on the local node instead of the cluster.
PySpark Use Case 3
Find the number of orders in each category (Complete, Closed, Cancelled, Processing, etc.)
- In the first solution, we use reduceByKey transformation to get the final aggregate of the number of orders in each category.
- In reduceByKey transformation, local aggregation is done before shuffling the data which reduces the volume of data to be shuffled.
- In the second solution, we use groupByKey transformation to get the final aggregate of the number of orders in each category.
- In groupByKey transformation, local aggregation is not performed before shuffling the data which can sometimes lead to out of memory error because huge volumes of data get shuffled.
Spark JOIN
Consider two datasets, orders (1GB file) and customers (1MB file) in HDFS, that have a common column (customer_id).
- The common column is used to join the two datasets (RDDs). In this case, data gets shuffled to a single partition.
- Since Join is a wide transformation, a complicated DAG (Directed Acyclic Graph) gets created with two stages.
Broadcast JOIN
- The larger dataset (1 GB) is distributed across the cluster, while a complete copy of the smaller dataset (1 MB) is made available on every machine on the cluster.
- Each node performs the join locally, so shuffling of data is hugely reduced.
- Since there is no shuffling involved in a broadcast join (narrow transformation), a simple DAG with one stage is created.
Repartition and Coalesce
Repartition: When there is a need to increase or decrease the number of partitions, then repartition can be used as it can do both. Repartition does a complete reshuffle of data for changing the number of partitions in order to have equal sized partitions.
Coalesce: Coalesce can only decrease the number of partitions and cannot increase the number of partitions. It tries to merge the partitions on the same node to form a new partition but shuffling is avoided.
Cache
- When an action is called for the first time, multiple costly transformations need to be executed.
- Wide transformations result in a complicated DAG.
- If the same action is called again without caching, the costly transformations will get executed again.
- In order to prevent the re-execution of transformations, we can cache the results.
Conclusion
By performing these operations, we’ve mastered using Apace Spark Core APIs and gained insights into various industry use cases of PySpark. We’ve learnt about partitions, differences between narrow and wide transformations and the resulting Directed Acyclic Graphs, internals of processing large datasets using Joins, caching and much more.
Our expertise in Data Engineering is expanding, opening doors to innovative solutions and enriching experiences. Our journey is filled with endless possibilities! Let’s keep exploring and embracing the power of data.