visibility 130,118 comment 4 access_time 4 years ago languageEnglish
more_vert
Raymond Raymond
articleArticles 549
codeCode 3
imageDiagrams 49
descriptionNotebooks 0
chat_bubble_outlineThreads 8
commentComments 268
loyaltyKontext Points 6058
account_circleProfile

Data Partition in Spark (PySpark) In-depth Walkthrough

Data partitioning is critical to data processing performance especially for large volume of data processing in Spark. Partitions in Spark won’t span across nodes though one node can contains more than one partitions. When processing, Spark assigns one task for each partition and each worker threads ...
info Last modified by Administrator 13 days ago
thumb_up 28

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

comment Comments
2 years ago link more_vert
G Gopinath
articleArticles 0
codeCode 0
imageDiagrams 0
descriptionNotebooks 0
chat_bubble_outlineThreads 0
commentComments 2
loyaltyKontext Points 2
#1495 Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough

Thank you Raymond!!

format_quote

person Raymond access_time 2 years ago
Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough

Hi Gopinath,

You are mostly correct. 

Here are a few items to consider:

  • There are also replicas in normal production setup, which means the data will be written into the replica nodes too. 
  • HDFS NameNode will decide the node to save each block based on block placement policy. HDFS client location will also impact where the blocks will be stored. For example, if the client is on the HDFS node, the blocks will be placed on the same 
  • By default, each block will be corresponded with one partition in RDD when reading from HDFS in Spark.
2 years ago link more_vert
Raymond Raymond
articleArticles 549
codeCode 3
imageDiagrams 49
descriptionNotebooks 0
chat_bubble_outlineThreads 8
commentComments 268
loyaltyKontext Points 6058
account_circleProfile
#1494 Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough

Hi Gopinath,

You are mostly correct. 

Here are a few items to consider:

  • There are also replicas in normal production setup, which means the data will be written into the replica nodes too. 
  • HDFS NameNode will decide the node to save each block based on block placement policy. HDFS client location will also impact where the blocks will be stored. For example, if the client is on the HDFS node, the blocks will be placed on the same 
  • By default, each block will be corresponded with one partition in RDD when reading from HDFS in Spark.
format_quote

person Gopinath access_time 2 years ago
Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough

Great article!!

And to answer your question, let's assume the size of one partition file created by Spark is 200MB and writing it to the HDFS with block size of 128MB, then one partition file will be distributed across 2 HDFS data node, and if we read back this file from HDFS, spark rdd will have 2 partitions(since file is distributed across 2 HDFS data node.)

Correct this answer if it is wrong.

Thank you!

2 years ago link more_vert
G Gopinath
articleArticles 0
codeCode 0
imageDiagrams 0
descriptionNotebooks 0
chat_bubble_outlineThreads 0
commentComments 2
loyaltyKontext Points 2
#1493 Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough

Great article!!

And to answer your question, let's assume the size of one partition file created by Spark is 200MB and writing it to the HDFS with block size of 128MB, then one partition file will be distributed across 2 HDFS data node, and if we read back this file from HDFS, spark rdd will have 2 partitions(since file is distributed across 2 HDFS data node.)

Correct this answer if it is wrong.

Thank you!

recommendMore from Kontext