Random sampling from a large dataset

后端 未结 1 1987
失恋的感觉
失恋的感觉 2021-01-13 09:06

There\'s a large database from which I have extracted a study population. For comparison purposes, I would like to select a control group that has similar characteristics.

相关标签:
1条回答
  • 2021-01-13 09:33
    select
       T1.sex,
       T1.decades,
       T1.counts,
       T2.patid
    
    from (
    
       select 
          sex, 
          age/10 as decades,
          COUNT(*) as counts
       from (
    
          select  m.patid,
             m.sex,
             DATEPART(year,min(c.admitdate)) -m.yrdob as Age
          from members as m
          inner join claims as c on c.patid=m.PATID
          group by m.PATID, m.sex,m.yrdob
       )x 
       group by sex, Age/10
    ) as T1
    join (
       --right here is where the random sampling occurs
        SELECT TOP 50--this is the total number of peolpe in our dataset
          patid
          ,sex
          ,decades
    
       from (
          select  m.patid,
             m.sex,
             (DATEPART(year,min(c.admitdate)) -m.yrdob)/10 as decades
          from members as m
          inner join claims as c on c.patid=m.PATID
          group by m.PATID, m.sex, m.yrdob
    
       ) T2
          order by NEWID()
    ) as T2
    on T2.sex = T1.sex
    and T2.decades = T1.decades 
    

    EDIT: I had posted another question similar to this in which I found that my results weren't in fact random, but they were only the TOP N results. I had ordered by newid() in the outermost query and all that was doing was shuffling around the exact same result set. From a question that is now closed, I found out that I needed to use the TOP keyword along with order by newid() in the commented line in the above query.

    0 讨论(0)
提交回复
热议问题