person Raymond access_time 2 years ago
Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough
You are mostly correct.
Here are a few items to consider:
- There are also replicas in normal production setup, which means the data will be written into the replica nodes too.
- HDFS NameNode will decide the node to save each block based on block placement policy. HDFS client location will also impact where the blocks will be stored. For example, if the client is on the HDFS node, the blocks will be placed on the same
- By default, each block will be corresponded with one partition in RDD when reading from HDFS in Spark.
Thank you Raymond!!