How can we deal with intersection tables that quickly grow very large?

后端 未结 1 1012
猫巷女王i
猫巷女王i 2021-02-10 09:25

For example, we have table A, and table B which have a many-to-many relationship. An intersection table, Table C stores A.id and B.id along with a value that represents a relati

1条回答
  •  情书的邮戳
    2021-02-10 10:06

    There is no magic. If some rows are connected and some aren't, this information has to be represented somehow, and the "relational" way of doing it is a "junction" (aka "link") table. Yes, a junction table can grow large, but fortunately databases are very capable of handling huge amounts of data.

    There are good reasons for using junction table versus comma-separated list (or similar), including:

    • Efficient querying (through indexing and clustering).
    • Enforcement of referential integrity.

    When designing a junction table, ask the following questions:

    1. Do I need to query in only one direction or both?1
      • If one direction, just create a composite PRIMARY KEY on both foreign keys (let's call them PARENT_ID and CHILD_ID). Order matters: if you query from parent to children, PK should be: {PARENT_ID, CHILD_ID}.
      • If both directions, also create a composite index in the opposite order, which is {CHILD_ID, PARENT_ID} in this case.
    2. Is the "extra" data small?
      • If yes, cluster the table and cover the extra data in the secondary index as necessary.2
      • I no, don't cluster the table and don't cover the extra data in the secondary index.3
    3. Are there any additional tables for which the junction table acts as a parent?
      • If yes, consider whether adding a surrogate key might be worthwhile to keep child FKs slim. But beware that if you add a surrogate key, this will probably eliminate the opportunity for clustering.

    In many cases, answers to these questions will be: both, yes and no, in which case your table will look similar to this (Oracle syntax below):

    CREATE TABLE JUNCTION_TABLE (
        PARENT_ID INT,
        CHILD_ID INT,
        EXTRA_DATA VARCHAR2(50),
        PRIMARY KEY (PARENT_ID, CHILD_ID),
        FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
        FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
    ) ORGANIZATION INDEX COMPRESS;
    
    CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
        JUNCTION_TABLE (CHILD_ID, PARENT_ID, EXTRA_DATA) COMPRESS;
    

    Considerations:

    • ORGANIZATION INDEX: Oracle-specific syntax for what most DBMSes call clustering. Other DBMSes have their own syntax and some (MySQL/InnoDB) imply clustering and user cannot turn it off.
    • COMPRESS: Some DBMSes support leading-edge index compression. Since clustered table is essentially an index, compression can be applied to it as well.
    • JUNCTION_TABLE_IE1, EXTRA_DATA: Since extra data is covered by the secondary index, DBMS can get it without touching the table when querying in the direction from child to parents. Primary key acts as a clustering key so the extra data is naturally covered when querying from a parent to the children.

    Physically, you have just two B-Trees (one is the clustered table and the other is the secondary index) and no table heap at all. This translates to good querying performance (both parent-to-child and child-to-parent directions can be satisfied by a simple index range scan) and fairly small overhead when inserting/deleting rows.

    Here is the equivalent MS SQL Server syntax (sans index compression):

    CREATE TABLE JUNCTION_TABLE (
        PARENT_ID INT,
        CHILD_ID INT,
        EXTRA_DATA VARCHAR(50),
        PRIMARY KEY (PARENT_ID, CHILD_ID),
        FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
        FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
    );
    
    CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
        JUNCTION_TABLE (CHILD_ID, PARENT_ID) INCLUDE (EXTRA_DATA);
    

    Note that MS SQL Server automatically clusters tables, unless PRIMARY KEY NONCLUSTERED is specified.


    1 In other words, do you only need to get "children" of given "parent", or you might also need to get parents of given child.

    2 Covering allows the query to be satisfied from the index alone, and avoids expensive double-lookup that would otherwise be necessary when accessing data through a secondary index in the clustered table.

    3 This way, the extra data is not repeated (which would be expensive, since it's big), yet you avoid the double-lookup and replace it with (cheaper) table heap access. But, beware of clustering factor that can destroy the performance of range scans in heap-based tables!

    0 讨论(0)
提交回复
热议问题