Database sharding vs partitioning

前端 未结 8 1797
滥情空心
滥情空心 2021-01-29 17:46

I have been reading about scalable architectures recently. In that context, two words that keep on showing up with regards to databases are sharding and partitionin

相关标签:
8条回答
  • 2021-01-29 18:21

    When talking about partitioning please do not use term replicate or replication. Replication is a different concept and out of scope of this page. When we talk about partitioning then better word is divide and when we talk about sharding then better word is distribute. In partition (normally and in common understanding not always) the rows of large data set table are divided into two or more disjoint (not sharing any row) groups. You can call each group a partition. These groups or all the partitions remain under the control of once RDMB instance and this is all logical. The base of each group can be a hash or range or etc. If you have ten years data in a table then you can store each of the year's data in a separate partition and this can be achieved by setting partition boundaries on the basis of a non-null column CREATE_DATE. Once you query the db then if you specify a create date between 01-01-1999 and 31-12-2000 then only two partitions will be hit and it will be sequential. I did similar on DB for billion + records and sql time came to 50 millis from 30 seconds using indices etc all. Sharding is that you host each partition on a different node/machine. Now searching inside the partitions/shards can happen in parallel.

    0 讨论(0)
  • 2021-01-29 18:23

    I've been diving into this as well and although I'm by far the reference on the matter, there are few key facts that I've gathered and points that I'd like to share:

    A partition is a division of a logical database or its constituent elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons, as for load balancing.

    https://en.wikipedia.org/wiki/Partition_(database)

    Sharding is a type of partitioning, such as Horizontal Partitioning (HP)

    There is also Vertical Partitioning (VP) whereby you split a table into smaller distinct parts. Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized.

    https://en.wikipedia.org/wiki/Shard_(database_architecture)

    I really like Tony Baco's answer on Quora where he makes you think in terms of schema (rather than columns and rows). He states that...

    "Horizontal partitioning", or sharding, is replicating [copying] the schema, and then dividing the data based on a shard key.

    "Vertical partitioning" involves dividing up the schema (and the data goes along for the ride).

    https://www.quora.com/Whats-the-difference-between-sharding-DB-tables-and-partitioning-them

    Oracle's Database Partitioning Guide has some nice figures. I have copied a few excerpts from the article.

    https://docs.oracle.com/cd/B28359_01/server.111/b32024/partition.htm

    When to Partition a Table

    Here are some suggestions for when to partition a table:

    • Tables greater than 2 GB should always be considered as candidates for partitioning.
    • Tables containing historical data, in which new data is added into the newest partition. A typical example is a historical table where only the current month's data is updatable and the other 11 months are read only.
    • When the contents of a table need to be distributed across different types of storage devices.

    Partition Pruning

    Partition pruning is the simplest and also the most substantial means to improve performance using partitioning. Partition pruning can often improve query performance by several orders of magnitude. For example, suppose an application contains an Orders table containing a historical record of orders, and that this table has been partitioned by week. A query requesting orders for a single week would only access a single partition of the Orders table. If the Orders table had 2 years of historical data, then this query would access one partition instead of 104 partitions. This query could potentially execute 100 times faster simply because of partition pruning.

    Partitioning Strategies

    • Range
    • Hash
    • List

    You can read their text and visualize their images which explain everything pretty well.

    And lastly, it is important to understand that databases are extremely resource intensive:

    • CPU
    • Disk
    • I/O
    • Memory

    Many DBA's will partition on the same machine, where the partitions will share all the resources but provide an improvement in disk and I/O by splitting up the data and/or index.

    While other strategies will employ a "shared nothing" architecture where the shards will reside on separate and distinct computing units (nodes), having 100% of the CPU, disk, I/O and memory to itself. Providing it's own set of advantages and complexities.

    https://en.wikipedia.org/wiki/Shared_nothing_architecture

    0 讨论(0)
提交回复
热议问题