Data Distribution Approaches in Parallel Computing Systemimage

visibility 96 access_time 6 months ago languageEnglish

This diagram shows the typical algorithms to distribute data into a cluster for processing or computing. They are commonly used in systems like Teradata, SQL Server PWD, Azure Synapse, Spark, etc. 

  • Replicated - table are replicated to each node. This is useful to distribute small tables like reference tables to join with big tables. 
  • Round-robin distributed - data is randomly distributed to the nodes in the cluster using round-robin algorithm. This is useful for big tables without obvious candidate join keys. This will ensure data is evenly distributed across the cluster.  
  • Hash distributed - data is distributed using deterministic hashing algorithm on the key values. Same value will guarantee to be distributed to the same node. This is the most commonly used distribution approach for big tables. 

What are the other distribution algorithms you have used? 

copyright This page is subject to Site terms.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

comment Comments
No comments yet.
timeline Stats
Page index 0.57
local_offer Tags

info Info
Image URL
SVG URL
URL
More from Kontext
[Diagram] Data Engineering - Transactional Extract image
visibility 16
thumb_up 0
access_time 7 months ago
Data Engineering - Transactional Extract
[Diagram] Data Engineering - Delta Extract image
visibility 48
thumb_up 0
access_time 7 months ago
Data Engineering - Delta Extract