arrow_back Data Partition in Spark (PySpark) In-depth Walkthrough

access_time 2 years ago link more_vert
#1493 Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough

Great article!!

And to answer your question, let's assume the size of one partition file created by Spark is 200MB and writing it to the HDFS with block size of 128MB, then one partition file will be distributed across 2 HDFS data node, and if we read back this file from HDFS, spark rdd will have 2 partitions(since file is distributed across 2 HDFS data node.)

Correct this answer if it is wrong.

Thank you!

recommendMore from Kontext