How to update 2 new columns created in a table which has more than 250 million rows

前端 未结 1 1636
余生分开走
余生分开走 2021-01-21 11:43

I have to add 2 new columns col1 char(1) NULL, col2 char(1) NULL to a table which has more than 250 million rows. And I have to update the two columns

相关标签:
1条回答
  • 2021-01-21 12:34

    You didn't say the version of SQL Server you're using. Starting with SQL Server 2012 adding a new NOT NULL column with a default is, in most cases, instantaneous: only the table metadata is changed, and no rows are updated. Thanks to Martin Smith for this information. So in this version, you'd be better off dropping and recreating the columns.

    In prior versions, you could try something like this:

    WHILE 1 = 1 BEGIN
       WITH T AS (
          SELECT TOP (10000) *
          FROM dbo.YourTable
          WHERE
             T.Col1 IS NULL
             AND T.COl2 IS NULL
       )
       UPDATE T
       SET
          T.Col1 = '1',
          T.Col2 = '1'
       ;
       IF @@RowCount < 10000 BREAK; -- a trick to save one iteration most times
    END;
    

    This could take a long time to run but has the benefit that it will not hold a lock on the table for a long time. The exact combination of indexes and the usual row sizes are also going to affect how well it performs. The sweet spot for the number of rows to update is never constant. It could be 50,000, or 2,000. I have experimented with different counts in the past in chunked operations like this and found that 5,000 or 10,000 are usually pretty close to the optimum size.

    The above query could also be benefited, depending on the version of SQL Server (2008 and up), by a filtered index:

    CREATE UNIQUE NONCLUSTERED INDEX IX_YourTable ON dbo.YourTable (ClusteredColumns)
       WHERE Col1 IS NULL AND COl2 IS NULL;
    

    When you are done, drop the index.

    Note that if you had specified your two new columns with defaults and NOT NULL they would have the values added during the column creation--after which the default could then be dropped:

    ALTER TABLE dbo.YourTable ADD Col1 char(1)
       NOT NULL CONSTRAINT DF_YourTable_Col1 DEFAULT ('1');
    

    Unlike adding NULL columns to the end, which can be done lickety split, this could have taken a significant amount of time, so on your 250M row table this may not have been an option.

    UPDATE To address Bryan's comment:

    1. The rationale of doing it in small batches of 10,000 is that the negative effects of the update's "overhead" are largely ameliorated. Yes, indeed, it will be a LOT of activity--but it won't block for very long, and that is the #1 performance-harming effect of an activity like this: blocking for a long period.

    2. We have lots of knowledge of the locking potential of this query: an UPDATE EXCLUSIVE lock, and the prior point should keep any harmful effects from this to a minimum. Please share if there are additional locking concerns that I'm missing.

    3. The filtered index helps because it will allow reading only a few pages of the index, followed by a seek into the giant table. Due to the update, true, the filtered index will have to be maintained to remove the updated rows since they no longer qualify, and this does increase the cost of the write portion of the update. That sounds bad until you realize that the biggest part of the batched UPDATE above, without some kind of index, will be a table scan each time. Given 250M rows, that requires the same resources as 12,500 complete scans of the entire table!!! So my suggestion to use the index DOES work, and is a nice and easy shortcut alternative to walking the clustered index manually.

    4. The "basic laws of indexes" that they are bad for tables which have lots of write actions doesn't hold here. You are thinking of normal OLTP access patterns where the row being updated can be found with a seek, and then for a write, every additional index on the table will indeed create overhead that did not before exist. Compare this to the explanation in my previous point. Even if the filtered index makes the UPDATE part take 5 times as much I/O per row (doubtful), that will still be a reduction in I/O of over 2,500 times!!!.

    Evaluating the performance impact of the update is important, especially if the table is incredibly busy and constantly being use. If needed, scheduling it during off hours (if such exist) is, just as you suggested, basic sense.

    One potential weak point in my suggestion is that in SQL 2008 and below, adding the filtered index could take a long time--though maybe not, since it is a VERY narrow index and will be written in clustered order (probably with a single scan!). So if it does take too long to create, there is an alternative: walk the clustered index manually. That might look like this:

    DECLARE @ClusteredID int = 0; --assume clustered index is a single int column
    DECLARE @Updated TABLE (
       ClusteredID int NOT NULL
    );
    
    WHILE 1 = 1 BEGIN
       WITH T AS (
          SELECT TOP (10000) *
          FROM dbo.YourTable
          WHERE ClusteredID > @ClusteredID -- the "walking" part
          ORDER BY ClusteredID -- also crucial for "walking"
       )
       UPDATE T
       SET
          T.Col1 = '1',
          T.Col2 = '1'
       OUTPUT Inserted.ClusteredID INTO @Updated
       ;
    
       IF @@RowCount = 0 BREAK;
    
       SELECT @ClusteredID = Max(ClusteredID)
       FROM @Updated
       ;
    
       DELETE @Updated;
    END;
    

    There you go: no index, seeks all the way, and only one effective scan of the entire table (with a tiny bit of overhead dealing with the table variable). If the ClusteredID column is densely packed, you can probably even dispense with the table variable and just add 10,000 manually at the end of each loop.

    You provided an update that you have 5 columns in your clustered index. Here's an updated script to show how you might accommodate that:

    DECLARE -- Five random data types seeded with guaranteed low values
       @Clustered1 int = 0,
       @Clustered2 int = 0,
       @Clustered3 varchar(10) = '',
       @Clustered4 datetime = '19000101',
       @Clustered5 int = 0
    ;
    
    DECLARE @Updated TABLE (
       Clustered1 int,
       Clustered2 int,
       Clustered3 varchar(10),
       Clustered4 datetime,
       Clustered5 int
    );
    
    WHILE 1 = 1 BEGIN
       WITH T AS (
          SELECT TOP (10000) *
          FROM dbo.YourTable
          WHERE
             Clustered1 > @Clustered1
             OR (
                Clustered1 = @Clustered1
                AND (
                   Clustered2 > @Clustered2
                   OR (
                      Clustered2 = @Clustered2
                      AND (
                         Clustered3 > @Clustered3
                         OR (
                            Clustered3 = @Clustered3
                            AND (
                               Clustered4 > @Clustered4
                               OR (
                                  Clustered4 = @Clustered4
                                  AND Clustered5 > @Clustered5
                               )
                            )
                         )
                      )
                   )
                )
             )
          ORDER BY
             Clustered1, -- also crucial for "walking"
             Clustered2,
             Clustered3,
             Clustered4,
             Clustered5
       )
       UPDATE T
       SET
          T.Col1 = '1',
          T.Col2 = '1'
       OUTPUT
          Inserted.Clustered1,
          Inserted.Clustered2,
          Inserted.Clustered3,
          Inserted.Clustered4,
          Inserted.Clustered5
       INTO @Updated
       ;
    
       IF @@RowCount < 10000 BREAK;
    
       SELECT TOP (1)
         @Clustered1 = Clustered1
         @Clustered2 = Clustered2,
         @Clustered3 = Clustered3,
         @Clustered4 = Clustered4,
         @Clustered5 = Clustered5
       FROM @Updated
       ORDER BY
          Clustered1,
          Clustered2,
          Clustered3,
          Clustered4,
          Clustered5
       ;
    
       DELETE @Updated;
    END;
    

    If you find that one particular way of doing it doesn't work, try another. Understanding the database system at a deeper level will lead to better ideas and superior solutions. I know the deeply-nested WHERE condition is a doozy. You could try the following on for size as well--this works exactly the same but is much harder to understand so I can't really recommend it, even though adding additional columns is very easy.

    WITH T AS (
       SELECT TOP (10000) *
       FROM
          dbo.YourTable T
       WHERE
          122 <=
             CASE WHEN Clustered1 > @Clustered1 THEN 172 WHEN Clustered1 = @Clustered1 THEN 81 ELSE 0 END
             + CASE WHEN Clustered2 > @Clustered2 THEN 54 WHEN Clustered1 = @Clustered2 THEN 27 ELSE 0 END
             + CASE WHEN Clustered3 > @Clustered3 THEN 18 WHEN Clustered3 = @Clustered3 THEN 9 ELSE 0 END
             + CASE WHEN Clustered4 > @Clustered4 THEN 6 WHEN Clustered4 = @Clustered4 THEN 3 ELSE 0 END
             + CASE WHEN Clustered5 > @Clustered5 THEN 2 WHEN Clustered5 = @Clustered5 THEN 1 ELSE 0 END
       ORDER BY
          Clustered1, -- also crucial for "walking"
          Clustered2,
          Clustered3,
          Clustered4,
          Clustered5
    )
    UPDATE T
    SET
       T.Col1 = '1',
       T.Col2 = '1'
    OUTPUT
       Inserted.Clustered1,
       Inserted.Clustered2,
       Inserted.Clustered3,
       Inserted.Clustered4,
       Inserted.Clustered5
    INTO @Updated
    ;
    

    I have many times performed updates on gigantic tables with this exact "walk-the-clustered-index in small batches" strategy with no ill effect on the production database.

    0 讨论(0)
提交回复
热议问题