I\'m trying to understand how the partition created for DynamoDB tables.
According to this blog, \"All items with the same partition key are stored together\", so if
Point of confusion:
Other answers already have detailed explanation of how partitions are created by DynamoDB. So with out going into that details, let me explain the root cause of confusion while trying to understand the relationship between Partition Keys and Partitions in DynamoDB.
IMHO, naming the key as "Partition Key" is the cause of confusion. It should just be called Primary Key. By hearing Partition Key, our mind start relating each Partition Key to one Partition. One-to-one relationship. Which is not the case. As mentioned in the question itself, the key is an input for the "internal hash function". The output of the function is the actual reference to the partition.
Thus, for a table having 1000 user ids ( Partition Keys), DynamoDB need not have 1000 partitions. It may have 1/5/10 any numbers of partitions, that is decided by the throughout (capacity unit) setting you have mentioned.
Partitions may be increased when you increase the throughput setting.
The number of partitions can also be increased based on the increasing volume of your data. When the existing partitions can not handle it.
Hence, what we call Partition Key in DynamoDB is nothing but Primary Key representing unique item in the table (with the help of sort key, in case of composite key). It does not relate one-to-one to a partition (which is a storage allocation unit for table backed by SSD) directly. Actual key to a partition is obtained by passing this partition key to an internal has function.
More details here.
As Per AWS DynamoDB Blog Post : Choosing the Right DynamoDB Partition Key
Choosing the Right DynamoDB Partition Key is an important step in the design and building of scalable and reliable applications on top of DynamoDB.
What is a partition key?
DynamoDB supports two types of primary keys:
Partition key: Also known as a hash key, the partition key is composed of a single attribute. Attributes in DynamoDB are similar in many ways to fields or columns in other database systems.
Partition key and sort key: Referred to as a composite primary key or hash-range key, this type of key is composed of two attributes. The first attribute is the partition key, and the second attribute is the sort key. Here is an example:
Why do I need a partition key?
DynamoDB stores data as groups of attributes, known as items. Items are similar to rows or records in other database systems. DynamoDB stores and retrieves each item based on the primary key value which must be unique. Items are distributed across 10 GB storage units, called partitions (physical storage internal to DynamoDB). Each table has one or more partitions, as shown in Figure 2. For more information, see the Understand Partition Behavior in the DynamoDB Developer Guide.
DynamoDB uses the partition key’s value as an input to an internal hash function. The output from the hash function determines the partition in which the item will be stored. Each item’s location is determined by the hash value of its partition key.
All items with the same partition key are stored together, and for composite partition keys, are ordered by the sort key value. DynamoDB will split partitions by sort key if the collection size grows bigger than 10 GB.
Recommendations for partition keys
Use high-cardinality attributes. These are attributes that have distinct values for each item like e-mail id, employee_no, customerid, sessionid, ordered, and so on.
Use composite attributes. Try to combine more than one attribute to form a unique key, if that meets your access pattern. For example, consider an orders table with customerid+productid+countrycode as the partition key and order_date as the sort key.
Cache the popular items when there is a high volume of read traffic. The cache acts as a low-pass filter, preventing reads of unusually popular items from swamping partitions. For example, consider a table that has deals information for products. Some deals are expected to be more popular than others during major sale events like Black Friday or Cyber Monday.
Add random numbers/digits from a predetermined range for write-heavy use cases. If you expect a large volume of writes for a partition key, use an additional prefix or suffix (a fixed number from predeternmined range, say 1-10) and add it to the partition key. For example, consider a table of invoice transactions. A single invoice can contain thousands of transactions per client.
Read More @ Choosing the Right DynamoDB Partition Key
When an Amazon DynamoDB table is created, you can specify the desired throughput in Reads per second and Writes per second. The table will then be provisioned across multiple servers (partitions) sufficient to provide the requested throughput.
You do not have visibility into the number of partitions created -- it is fully managed by DynamoDB. Additional partitions will be created as the quantity of data increases or when the provisioned throughput is increased.
Let's say you have requested 1000 Reads per second and the data has been internally partitioned across 10 servers (10 partitions). Each partition will provide 100 Reads per second. If all Read requests are for the same partition key, the throughput will be limited to 100 Reads per second. If the requests are spread over a range of different values, the throughput can be the full 1000 Reads per second.
If many queries are made for the same Partition Key, it can result in a Hot Partition that limits the total available throughput.
Think of it like a bank with lines in front of teller windows. If everybody lines up at one teller, less customers can be served. It is more efficient to distribute customers across many different teller windows. A good partition key for distributing customers might be the customer number, since it is different for each customer. A poor partition key might their zip code because they all live in the same area nearby the bank.
The simple rule is that you should choose a Partition Key that has a range of different values.
See: Partitions and Data Distribution