arrow_back Data Partition in Spark (PySpark) In-depth Walkthrough

access_time 2 years ago link more_vert
#1495 Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough

Thank you Raymond!!

format_quote

person Raymond access_time 2 years ago
Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough

Hi Gopinath,

You are mostly correct. 

Here are a few items to consider:

  • There are also replicas in normal production setup, which means the data will be written into the replica nodes too. 
  • HDFS NameNode will decide the node to save each block based on block placement policy. HDFS client location will also impact where the blocks will be stored. For example, if the client is on the HDFS node, the blocks will be placed on the same 
  • By default, each block will be corresponded with one partition in RDD when reading from HDFS in Spark.
recommendMore from Kontext