Data Distribution Approaches in Parallel Computing System

Raymond Raymond event 2022-01-10 visibility 263 comment 0
more_vert

This diagram shows the typical algorithms to distribute data into a cluster for processing or computing. They are commonly used in systems like Teradata, SQL Server PWD, Azure Synapse, Spark, etc. 

  • Replicated - table are replicated to each node. This is useful to distribute small tables like reference tables to join with big tables. 
  • Round-robin distributed - data is randomly distributed to the nodes in the cluster using round-robin algorithm. This is useful for big tables without obvious candidate join keys. This will ensure data is evenly distributed across the cluster.  
  • Hash distributed - data is distributed using deterministic hashing algorithm on the key values. Same value will guarantee to be distributed to the same node. This is the most commonly used distribution approach for big tables. 

What are the other distribution algorithms you have used? 

comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

tag Tags
info Info
Image URL
SVG URL