Update n random rows in SQL

前端未结

关注

 3  1940

I have table which is having about 1000 rows.I have to update a column(\"X\") in the table to \'Y\' for n ramdom rows. For this i can have following query

update


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  挽巷        
                
              
                            
                2021-02-06 11:37
              
            
            
                                                                       
I would use the ROWID:

UPDATE xyz SET x='Y' WHERE rowid IN (
    SELECT r FROM (
        SELECT ROWID r FROM xyz ORDER BY dbms_random.value
    ) RNDM WHERE rownum < n+1
)


The actual reason I would use ROWID isn't for efficiency though (it will still do a full table scan) - your SQL may not update the number of rows you want if column m isn't unique.

With only 1000 rows, you shouldn't really be worried about efficiency (maybe with a hundred million rows).  Without any index on this table, you're stuck doing a full table scan to select random records.

[EDIT:]  "But what if there are 100,000 rows"

Well, that's still 3 orders of magnitude less than 100 million.

I ran the following:

create table xyz as select * from all_objects;


[created about 50,000 rows on my system - non-indexed, just like your table]

UPDATE xyz SET owner='Y' WHERE rowid IN (
     SELECT r FROM (
          SELECT ROWID r FROM xyz ORDER BY dbms_random.value
     ) RNDM WHERE rownum < 10000
);
commit;


This took approximately 1.5 seconds.  Maybe it was 1 second, maybe up to 3 seconds (didn't formally time it, it just took about enough time to blink).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不知归路        
                
              
                            
                2021-02-06 11:50
              
            
            
                                                                       
You can improve performance by replacing the full table scan with a sample.

The first problem you run into is that you can't use SAMPLE in a DML subquery, ORA-30560: SAMPLE clause not allowed.  But logically this is what is needed:

UPDATE xyz SET x='Y' WHERE rowid IN (
    SELECT r FROM (
        SELECT ROWID r FROM xyz sample(0.15) ORDER BY dbms_random.value
    ) RNDM WHERE rownum < 100/*n*/+1
);


You can get around this by using a collection to store the rowids, and then update the rows using the rowid collection.  Normally breaking a query into separate parts and gluing them together with PL/SQL leads to horrible performance.  But in this case you can still save a lot of time by significantly reducing the amount of data read.

declare
    type rowid_nt is table of rowid;
    rowids rowid_nt;
begin
    --Get the rowids
    SELECT r bulk collect into rowids
    FROM (
        SELECT ROWID r
        FROM xyz sample(0.15)
        ORDER BY dbms_random.value
    ) RNDM WHERE rownum < 100/*n*/+1;

    --update the table
    forall i in 1 .. rowids.count
        update xyz set x = 'Y'
        where rowid = rowids(i);
end;
/


I ran a simple test with 100,000 rows (on a table with only two columns), and N = 100.
The original version took 0.85 seconds, @Gerrat's answer took 0.7 seconds, and the PL/SQL version took 0.015 seconds.

But that's only one scenario, I don't have enough information to say my answer will always be better. As N increases the sampling advantage is lost, and the writing will be more significant than the reading. If you have a very small amount of data, the PL/SQL context switching overhead in my answer may make it slower than @Gerrat's solution.

For performance issues, the size of the table in bytes is usually much more important than the size in rows.  1000 rows that use a terabyte of space is much larger than 100 million rows that only use a gigabyte.

Here are some problems to consider with my answer:


Sampling does not always return exactly the percent you asked for.  With 100,000 rows and a 0.15% sample size the number of rows returned was 147, not 150.  That's why I used 0.15 instead of 0.10.  You need to over-sample a little bit to ensure that you get more than N.  How much do you need to over-sample?  I have no idea, you'll probably have to test it and pick a safe number.
You need to know the approximate number of rows to pick the percent.
The percent must be a literal, so as the number of rows and N change, you'll need to use dynamic SQL to change the percent.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  攒了一身酷        
                
              
                            
                2021-02-06 12:03
              
            
            
                                                                       
The following solution works just fine. It's performant and seems to be similar to sample():

create table t1 as 
    select level id, cast ('item'||level as varchar2(32)) item 
    from dual connect by level<=100000; 

Table T1 created.

update t1 set item='*'||item 
where exists (
    select rnd from (
        select dbms_random.value() rnd
        from t1
    ) t2 where t2.rowid = t1.rowid and rnd < 0.15
);

14,858 rows updated.

Elapsed: 00:00:00.717 


Consider that alias rnd must be included in select clause. Otherwise changes the omptimizer the filter predicat from RND<0.1 to DBMS_RANDOM.VALUE()<0.1. In that case dbms_random.value will be executed only once.

As mentioned in answer @JonHeller, the best solution remains the pl/sql code block because it allows to avoid full table scan. Here is my suggestion: 

create or replace type rowidListType is table of varchar(18);  
/
create or replace procedure updateRandomly (prefix varchar2 := '*') is
    rowidList rowidListType;  
begin  
    select rowidtochar (rowid) bulk collect into rowidList
    from t1 sample(15)
    ;
    update t1 set item=prefix||item 
    where exists (
        select 1 from table (rowidList) t2
        where chartorowid(t2.column_value) = t1.rowid
    );
    dbms_output.put_line ('updated '||sql%rowcount||' rows.'); 
end;
/
begin  updateRandomly; end;
/ 

Elapsed: 00:00:00.293
updated 14892 rows.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复