Update n random rows in SQL

前端 未结 3 1940
予麋鹿
予麋鹿 2021-02-06 11:00

I have table which is having about 1000 rows.I have to update a column(\"X\") in the table to \'Y\' for n ramdom rows. For this i can have following query

update         


        
相关标签:
3条回答
  • 2021-02-06 11:37

    I would use the ROWID:

    UPDATE xyz SET x='Y' WHERE rowid IN (
        SELECT r FROM (
            SELECT ROWID r FROM xyz ORDER BY dbms_random.value
        ) RNDM WHERE rownum < n+1
    )
    

    The actual reason I would use ROWID isn't for efficiency though (it will still do a full table scan) - your SQL may not update the number of rows you want if column m isn't unique.

    With only 1000 rows, you shouldn't really be worried about efficiency (maybe with a hundred million rows). Without any index on this table, you're stuck doing a full table scan to select random records.

    [EDIT:] "But what if there are 100,000 rows"

    Well, that's still 3 orders of magnitude less than 100 million.

    I ran the following:

    create table xyz as select * from all_objects;
    

    [created about 50,000 rows on my system - non-indexed, just like your table]

    UPDATE xyz SET owner='Y' WHERE rowid IN (
         SELECT r FROM (
              SELECT ROWID r FROM xyz ORDER BY dbms_random.value
         ) RNDM WHERE rownum < 10000
    );
    commit;
    

    This took approximately 1.5 seconds. Maybe it was 1 second, maybe up to 3 seconds (didn't formally time it, it just took about enough time to blink).

    0 讨论(0)
  • 2021-02-06 11:50

    You can improve performance by replacing the full table scan with a sample.

    The first problem you run into is that you can't use SAMPLE in a DML subquery, ORA-30560: SAMPLE clause not allowed. But logically this is what is needed:

    UPDATE xyz SET x='Y' WHERE rowid IN (
        SELECT r FROM (
            SELECT ROWID r FROM xyz sample(0.15) ORDER BY dbms_random.value
        ) RNDM WHERE rownum < 100/*n*/+1
    );
    

    You can get around this by using a collection to store the rowids, and then update the rows using the rowid collection. Normally breaking a query into separate parts and gluing them together with PL/SQL leads to horrible performance. But in this case you can still save a lot of time by significantly reducing the amount of data read.

    declare
        type rowid_nt is table of rowid;
        rowids rowid_nt;
    begin
        --Get the rowids
        SELECT r bulk collect into rowids
        FROM (
            SELECT ROWID r
            FROM xyz sample(0.15)
            ORDER BY dbms_random.value
        ) RNDM WHERE rownum < 100/*n*/+1;
    
        --update the table
        forall i in 1 .. rowids.count
            update xyz set x = 'Y'
            where rowid = rowids(i);
    end;
    /
    

    I ran a simple test with 100,000 rows (on a table with only two columns), and N = 100. The original version took 0.85 seconds, @Gerrat's answer took 0.7 seconds, and the PL/SQL version took 0.015 seconds.

    But that's only one scenario, I don't have enough information to say my answer will always be better. As N increases the sampling advantage is lost, and the writing will be more significant than the reading. If you have a very small amount of data, the PL/SQL context switching overhead in my answer may make it slower than @Gerrat's solution.

    For performance issues, the size of the table in bytes is usually much more important than the size in rows. 1000 rows that use a terabyte of space is much larger than 100 million rows that only use a gigabyte.

    Here are some problems to consider with my answer:

    1. Sampling does not always return exactly the percent you asked for. With 100,000 rows and a 0.15% sample size the number of rows returned was 147, not 150. That's why I used 0.15 instead of 0.10. You need to over-sample a little bit to ensure that you get more than N. How much do you need to over-sample? I have no idea, you'll probably have to test it and pick a safe number.
    2. You need to know the approximate number of rows to pick the percent.
    3. The percent must be a literal, so as the number of rows and N change, you'll need to use dynamic SQL to change the percent.
    0 讨论(0)
  • 2021-02-06 12:03

    The following solution works just fine. It's performant and seems to be similar to sample():

    create table t1 as 
        select level id, cast ('item'||level as varchar2(32)) item 
        from dual connect by level<=100000; 
    
    Table T1 created.
    
    update t1 set item='*'||item 
    where exists (
        select rnd from (
            select dbms_random.value() rnd
            from t1
        ) t2 where t2.rowid = t1.rowid and rnd < 0.15
    );
    
    14,858 rows updated.
    
    Elapsed: 00:00:00.717 
    

    Consider that alias rnd must be included in select clause. Otherwise changes the omptimizer the filter predicat from RND<0.1 to DBMS_RANDOM.VALUE()<0.1. In that case dbms_random.value will be executed only once.

    As mentioned in answer @JonHeller, the best solution remains the pl/sql code block because it allows to avoid full table scan. Here is my suggestion:

    create or replace type rowidListType is table of varchar(18);  
    /
    create or replace procedure updateRandomly (prefix varchar2 := '*') is
        rowidList rowidListType;  
    begin  
        select rowidtochar (rowid) bulk collect into rowidList
        from t1 sample(15)
        ;
        update t1 set item=prefix||item 
        where exists (
            select 1 from table (rowidList) t2
            where chartorowid(t2.column_value) = t1.rowid
        );
        dbms_output.put_line ('updated '||sql%rowcount||' rows.'); 
    end;
    /
    begin  updateRandomly; end;
    / 
    
    Elapsed: 00:00:00.293
    updated 14892 rows.
    
    0 讨论(0)
提交回复
热议问题