This diagram shows the typical algorithms to distribute data into a cluster for processing or computing. They are commonly used in systems like Teradata, SQL Server PWD, Azure Synapse, Spark, etc.
- Replicated - table are replicated to each node. This is useful to distribute small tables like reference tables to join with big tables.
- Round-robin distributed - data is randomly distributed to the nodes in the cluster using round-robin algorithm. This is useful for big tables without obvious candidate join keys. This will ensure data is evenly distributed across the cluster.
- Hash distributed - data is distributed using deterministic hashing algorithm on the key values. Same value will guarantee to be distributed to the same node. This is the most commonly used distribution approach for big tables.
What are the other distribution algorithms you have used?