arrow_back Data Partition in Spark (PySpark) In-depth Walkthrough

access_time 2 years ago link more_vert
#1494 Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough

Hi Gopinath,

You are mostly correct. 

Here are a few items to consider:

  • There are also replicas in normal production setup, which means the data will be written into the replica nodes too. 
  • HDFS NameNode will decide the node to save each block based on block placement policy. HDFS client location will also impact where the blocks will be stored. For example, if the client is on the HDFS node, the blocks will be placed on the same 
  • By default, each block will be corresponded with one partition in RDD when reading from HDFS in Spark.
format_quote

person Gopinath access_time 2 years ago
Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough

Great article!!

And to answer your question, let's assume the size of one partition file created by Spark is 200MB and writing it to the HDFS with block size of 128MB, then one partition file will be distributed across 2 HDFS data node, and if we read back this file from HDFS, spark rdd will have 2 partitions(since file is distributed across 2 HDFS data node.)

Correct this answer if it is wrong.

Thank you!

recommendMore from Kontext