Populate random data from another table

后端 未结 2 2073
广开言路
广开言路 2021-01-18 23:44
update dataset1.test
   set column4 = (select column1 
                 from dataset2
                 order by random()
                 limit 1
                 )          


        
2条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-01-19 00:07

    SETUP

    Let's start by assuming your tables an data are the following ones. Note that I assume that dataset1 has a primary key (it can be a composite one, but, for the sake of simplicity, let's make it an integer):

    CREATE TABLE dataset1
    (
         id INTEGER PRIMARY KEY,
         column4 TEXT
    ) ;
    
    CREATE TABLE dataset2
    (
        column1 TEXT
    ) ;
    

    We fill both tables with sample data

    INSERT INTO dataset1
        (id, column4)
    SELECT
        i, 'column 4 for id ' || i
    FROM
        generate_series(101, 120) AS s(i);
    
    INSERT INTO dataset2
        (column1)
    SELECT
        'SOMETHING ' || i
    FROM 
        generate_series (1001, 1020) AS s(i) ;
    

    Sanity check:

    SELECT count(DISTINCT column4) FROM dataset1 ;
    
    | count |
    | ----: |
    |    20 |
    

    Case 1: number of rows in dataset1 <= rows in dataset2

    We'll perform a complete shuffling. Values from dataset2 will be used once, and no more than once.

    EXPLANATION

    In order to make an update that shuffles all the values from column4 in a random fashion, we need some intermediate steps.

    First, for the dataset1, we need to create a list (relation) of tuples (id, rn), that are just:

    (id_1,   1),
    (id_2,   2),
    (id_3,   3),
    ...
    (id_20, 20)
    

    Where id_1, ..., id_20 are the ids present on dataset1. They can be of any type, they need not be consecutive, and they can be composite.

    For the dataset2, we need to create another list of (column_1,rn), that looks like:

    (column1_1,  17),
    (column1_2,   3),
    (column1_3,  11),
    ...
    (column1_20, 15)
    

    In this case, the second column contains all the values 1 .. 20, but shuffled.

    Once we have the two relations, we JOIN them ON ... rn. This, in practice, produces yet another list of tuples with (id, column1), where the pairing has been done randomly. We use these pairs to update dataset1.

    THE REAL QUERY

    This can all be done (clearly, I hope) by using some CTE (WITH statement) to hold the intermediate relations:

    WITH original_keys AS
    (
        -- This creates tuples (id, rn), 
        -- where rn increases from 1 to number or rows
        SELECT 
            id, 
            row_number() OVER  () AS rn
        FROM 
            dataset1
    )
    , shuffled_data AS
    (
        -- This creates tuples (column1, rn)
        -- where rn moves between 1 and number of rows, but is randomly shuffled
        SELECT 
            column1,
            -- The next statement is what *shuffles* all the data
            row_number() OVER  (ORDER BY random()) AS rn
        FROM 
            dataset2
    )
    -- You update your dataset1
    -- with the shuffled data, linking back to the original keys
    UPDATE
        dataset1
    SET
        column4 = shuffled_data.column1
    FROM
        shuffled_data
        JOIN original_keys ON original_keys.rn = shuffled_data.rn
    WHERE
        dataset1.id = original_keys.id ;
    

    Note that the trick is performed by means of:

    row_number() OVER (ORDER BY random()) AS rn
    

    The row_number() window function that produces as many consecutive numbers as there are rows, starting from 1. These numbers are randomly shuffled because the OVER clause takes all the data and sorts it randomly.

    CHECKS

    We can check again:

    SELECT count(DISTINCT column4) FROM dataset1 ;
    
    | count |
    | ----: |
    |    20 |
    
    SELECT * FROM dataset1 ;
    
     id | column4       
    --: | :-------------
    101 | SOMETHING 1016
    102 | SOMETHING 1009
    103 | SOMETHING 1003
    ...
    118 | SOMETHING 1012
    119 | SOMETHING 1017
    120 | SOMETHING 1011
    

    ALTERNATIVE

    Note that this can also be done with subqueries, by simple substitution, instead of CTEs. That might improve performance in some occasions:

    UPDATE
        dataset1
    SET
        column4 = shuffled_data.column1
    FROM
        (SELECT 
            column1,
            row_number() OVER  (ORDER BY random()) AS rn
        FROM 
            dataset2
        ) AS shuffled_data
        JOIN 
        (SELECT 
            id, 
            row_number() OVER  () AS rn
        FROM 
            dataset1
        ) AS original_keys ON original_keys.rn = shuffled_data.rn
    WHERE
        dataset1.id = original_keys.id ;
    

    And again...

    SELECT * FROM dataset1;
    
     id | column4       
    --: | :-------------
    101 | SOMETHING 1011
    102 | SOMETHING 1018
    103 | SOMETHING 1007
    ...
    118 | SOMETHING 1020
    119 | SOMETHING 1002
    120 | SOMETHING 1016
    

    You can check the whole setup and experiment at dbfiddle here

    NOTE: if you do this with very large datasets, don't expect it to be extremely fast. Shuffling a very big deck of cards is expensive.


    Case 2: number of rows in dataset1 > rows in dataset2

    In this case, values for column4 can be repeated several times.

    The easiest possibility I can think of (probably, not an efficient one, but easy to understand) is to create a function random_column1, marked as VOLATILE:

    CREATE FUNCTION random_column1() 
        RETURNS TEXT
        VOLATILE      -- important!
        LANGUAGE SQL
    AS
    $$
        SELECT
            column1
        FROM
            dataset2
        ORDER BY
            random()
        LIMIT
            1 ;
    $$ ;
    

    And use it to update:

    UPDATE
        dataset1
    SET
        column4 = random_column1();
    

    This way, some values from dataset2 might not be used at all, whereas others will be used more than once.

    dbfiddle here

提交回复
热议问题