imageData Distribution Approaches in Parallel Computing System

visibility 124 access_time 9 months ago language English

This diagram shows the typical algorithms to distribute data into a cluster for processing or computing. They are commonly used in systems like Teradata, SQL Server PWD, Azure Synapse, Spark, etc. 

  • Replicated - table are replicated to each node. This is useful to distribute small tables like reference tables to join with big tables. 
  • Round-robin distributed - data is randomly distributed to the nodes in the cluster using round-robin algorithm. This is useful for big tables without obvious candidate join keys. This will ensure data is evenly distributed across the cluster.  
  • Hash distributed - data is distributed using deterministic hashing algorithm on the key values. Same value will guarantee to be distributed to the same node. This is the most commonly used distribution approach for big tables. 

What are the other distribution algorithms you have used? 

copyright This page is subject to Site terms.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

comment Comments
No comments yet.
tag Tags

info Info
Image URL
SVG URL
URL