Sampling from Oracle, Need exact number of results (Sample Clause)

℡╲_俬逩灬. 提交于 2019-11-29 11:00:20

Borrowing jonearles' example table, I see exactly the same thing (in 11gR2 on an OEL developer image), usually getting values for a heavily skewed towards 1; with small sample sizes I can sometimes see none at all. With the extra randomisation/restriction step I mentioned in a comment:

select a, count(*) from (
    select * from test1 sample (1)
    order by dbms_random.value
)
where rownum < 101
group by a;

... with three runs I got:

         A   COUNT(*)
---------- ----------
         1         71
         2         29

         A   COUNT(*)
---------- ----------
         1        100

         A   COUNT(*)
---------- ----------
         1         64
         2         36

Yes, 100% really came back as 1 on the second run. The skewing itself seems to be rather random. I tried with the block modifier which seemed to make little difference, perhaps surprisingly - I might have thought it would get worse in this situation.

This is likely to be slower, certainly for small sample sizes, as it has to hit the entire table; but does give me pretty even splits fairly consistently:

select a, count(*) from (
    select a, b from (
        select a, b, row_number() over (order by dbms_random.value) as rn
        from test1
    )
    where rn < 101
)
group by a;

With three runs I got:

         A   COUNT(*)
---------- ----------
         1         48
         2         52

         A   COUNT(*)
---------- ----------
         1         57
         2         43

         A   COUNT(*)
---------- ----------
         1         49
         2         51

... which looks a bit healthier. YMMV of course.


This Oracle article covers some sampling techniques, and you might want to evaluate the ora_hash approach as well, and the stratified version if your data spread and your requirements for 'representativeness' demand it.

You can't trust SAMPLE to return a truly random set of rows from a table. The algorithm appears to be based on the physical properties of the table.

create table test1(a number, b char(2000));

--Insert 10K fat records.  A is always 1.
insert into test1 select 1, level from dual connect by level <= 10000;

--Insert 10K skinny records.  A is always 2.
insert into test1 select 2, null from dual connect by level <= 10000;

--Select about 10 rows.
select * from test1 sample (0.1) order by a;

Run the last query multiple times and you will almost never see any 2s. This may be a accurate sample if you measure by bytes, but not by rows.

This is an extreme example of skewed data, but I think it's enough to show that RANDOM doesn't work the way the manual implies it should. As others have suggested, you'll probably want to ORDER BY DBMS_RANDOM.VALUE.

I've been fiddling about with a similar question. Firstly I set up what the sample size will be for the different Stratum. In your case it's only one. ('700064'). So in a with Clause or a temp table I did this:

Select DEPTID, Count(*) SAMPLE_ONE 
FROM PS_LEDGER  Sample(1)
WHERE DEPTID = '700064' 
Group By DEPTID

This tells you the records in a 1% sample to expect. Lets call that TABLE_1

Then I did this:

Select 
Ceil (Rank() over (Partition by DEPTID Order by DBMS_RANDOM.VALUE)
            / (Select SAMPLE_ONE From TABLE_1) STRATUM_GROUP
,A.*
FROM PS_LEDGER 

Make that another table. What you get then is Random Sample Sets of approx. 1% in size.

So if your original table held 1000 records you would get 100 random sample sets with 10 items in each set.

you can then select one of these randomly to test.

Not sure if I've explained this very well, but it worked for me. I had 168 Stratum Set up on a table with over 10Mil records worked quite well.

If you want more explanation or can improve this please don't hesitate.

Regards

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!