Update n random rows in SQL

前端 未结 3 1939
予麋鹿
予麋鹿 2021-02-06 11:00

I have table which is having about 1000 rows.I have to update a column(\"X\") in the table to \'Y\' for n ramdom rows. For this i can have following query

update         


        
3条回答
  •  不知归路
    2021-02-06 11:50

    You can improve performance by replacing the full table scan with a sample.

    The first problem you run into is that you can't use SAMPLE in a DML subquery, ORA-30560: SAMPLE clause not allowed. But logically this is what is needed:

    UPDATE xyz SET x='Y' WHERE rowid IN (
        SELECT r FROM (
            SELECT ROWID r FROM xyz sample(0.15) ORDER BY dbms_random.value
        ) RNDM WHERE rownum < 100/*n*/+1
    );
    

    You can get around this by using a collection to store the rowids, and then update the rows using the rowid collection. Normally breaking a query into separate parts and gluing them together with PL/SQL leads to horrible performance. But in this case you can still save a lot of time by significantly reducing the amount of data read.

    declare
        type rowid_nt is table of rowid;
        rowids rowid_nt;
    begin
        --Get the rowids
        SELECT r bulk collect into rowids
        FROM (
            SELECT ROWID r
            FROM xyz sample(0.15)
            ORDER BY dbms_random.value
        ) RNDM WHERE rownum < 100/*n*/+1;
    
        --update the table
        forall i in 1 .. rowids.count
            update xyz set x = 'Y'
            where rowid = rowids(i);
    end;
    /
    

    I ran a simple test with 100,000 rows (on a table with only two columns), and N = 100. The original version took 0.85 seconds, @Gerrat's answer took 0.7 seconds, and the PL/SQL version took 0.015 seconds.

    But that's only one scenario, I don't have enough information to say my answer will always be better. As N increases the sampling advantage is lost, and the writing will be more significant than the reading. If you have a very small amount of data, the PL/SQL context switching overhead in my answer may make it slower than @Gerrat's solution.

    For performance issues, the size of the table in bytes is usually much more important than the size in rows. 1000 rows that use a terabyte of space is much larger than 100 million rows that only use a gigabyte.

    Here are some problems to consider with my answer:

    1. Sampling does not always return exactly the percent you asked for. With 100,000 rows and a 0.15% sample size the number of rows returned was 147, not 150. That's why I used 0.15 instead of 0.10. You need to over-sample a little bit to ensure that you get more than N. How much do you need to over-sample? I have no idea, you'll probably have to test it and pick a safe number.
    2. You need to know the approximate number of rows to pick the percent.
    3. The percent must be a literal, so as the number of rows and N change, you'll need to use dynamic SQL to change the percent.

提交回复
热议问题