SQL random sample with groups

前端 未结 4 1997
天涯浪人
天涯浪人 2021-02-05 18:59

I have a university graduate database and would like to extract a random sample of data of around 1000 records.

I want to ensure the sample is representative of the pop

相关标签:
4条回答
  • 2021-02-05 19:29

    I've done similar queries (but not on MS SQL) using a ROW_NUMBER approach:

    select ...
    from 
     ( select ...
         ,row_number() over (partition by coursecode order by newid()) as rn
       from degree
     ) as d 
    join sample size as s
    on d.coursecode = s.coursecode
    and d.rn <= s.samplesize
    
    0 讨论(0)
  • 2021-02-05 19:34

    It is not necessary to partition the population at all.

    If you are taking a sample of 1000 from a population among hundreds of course codes, it stands to reason that many of those course codes will not be selected at all in any one sampling.

    If the population is uniform (say, a continuous sequence of student IDs), a uniformly-distributed sample will automatically be representative of population weighting by course code. Since newid() is a uniform random sampler, you're good to go out of the box.

    The only wrinkle that you might encounter is if a student ID is a associated with multiple course codes. In this case make a unique list (temporary table or subquery) containing a sequential id, student id and course code, sample the sequential id from it, grouping by student id to remove duplicates.

    0 讨论(0)
  • 2021-02-05 19:35

    You want a stratified sample. I would recommend doing this by sorting the data by course code and doing an nth sample. Here is one method that works best if you have a large population size:

    select d.*
    from (select d.*,
                 row_number() over (order by coursecode, newid) as seqnum,
                 count(*) over () as cnt
          from degree d
         ) d
    where seqnum % (cnt / 500) = 1;
    

    EDIT:

    You can also calculate the population size for each group "on the fly":

    select d.*
    from (select d.*,
                 row_number() over (partition by coursecode order by newid) as seqnum,
                 count(*) over () as cnt,
                 count(*) over (partition by coursecode) as cc_cnt
          from degree d
         ) d
    where seqnum < 500 * (cc_cnt * 1.0 / cnt)
    
    0 讨论(0)
  • 2021-02-05 19:49

    Add a table for storing population.

    I think it should be like this:

    SELECT *
    FROM (
        SELECT id, coursecode, ROW_NUMBER() OVER (PARTITION BY coursecode ORDER BY NEWID()) AS rn
        FROM degree) t
        LEFT OUTER JOIN
        population p ON t.coursecode = p.coursecode
    WHERE
        rn <= p.SampleSize
    
    0 讨论(0)
提交回复
热议问题