Data Distribution Approaches in Parallel Computing System

event 2022-01-10 visibility 269 comment 0

more_vert

Data Distribution Approaches in Parallel Computing System

This diagram shows the typical algorithms to distribute data into a cluster for processing or computing. They are commonly used in systems like Teradata, SQL Server PWD, Azure Synapse, Spark, etc.

Replicated - table are replicated to each node. This is useful to distribute small tables like reference tables to join with big tables.
Round-robin distributed - data is randomly distributed to the nodes in the cluster using round-robin algorithm. This is useful for big tables without obvious candidate join keys. This will ensure data is evenly distributed across the cluster.
Hash distributed - data is distributed using deterministic hashing algorithm on the key values. Same value will guarantee to be distributed to the same node. This is the most commonly used distribution approach for big tables.

What are the other distribution algorithms you have used?

copyright This page is subject to Site terms.

comment Comments

No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

tag Tags

data-engineering

info Info

Image URL

SVG URL

Solution Diagrams

Log in with external accounts

Data Distribution Approaches in Parallel Computing System

Log in with external accounts