I am trying to pull a random sample of a population from a Peoplesoft Database. The searches online have lead me to think that the Sample Clause of the select statement may be a viable option for us to use, however I am having trouble understanding how the Sample clause determines the number of samples returned. I have looked at the oracle documentation found here: http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#i2065953
But the above reference only talks about the syntax used to create the sample. The reason for my question is I need to understand how the sample percent determines the sample size returned. It seems like it applies a random number to the percent you ask for and then uses a seed number to count every "n" records. Our requirement is that we pull an exact number of samples for example, that they are randomly selected, and that they are representative of the entire table (or at least the grouping of data we choose with filters)
In a population of 10200 items if I need a sample of approximately 100 items, I could use this statement:
SELECT * FROM PS_LEDGER SAMPLE(1) --1 % of my total population
WHERE DEPTID = '700064'
However, We need to pull an exact number of samples (in this case 100) so I could pick a sample size that almost always returns more than the number I need then trim it down IE
SELECT Count(*) FROM PS_LEDGER SAMPLE(2.5) --this percent must always give > 100 items
WHERE DEPTID = '700064' and rownum < 101
My concern with doing that, is that my sample would not uniformly represent the entire population. For example if the sample function just pulls every N record after it creates its own randomly generated seed, then choosing the rownum < 101 will cut off all of the records chosen from the bottom of the table. What I am looking for is a way to pull out exactly 100 records from the table, which are randomly selected and fairly representative of the entire table. Please help!!
Borrowing jonearles' example table, I see exactly the same thing (in 11gR2 on an OEL developer image), usually getting values for a
heavily skewed towards 1
; with small sample sizes I can sometimes see none at all. With the extra randomisation/restriction step I mentioned in a comment:
select a, count(*) from (
select * from test1 sample (1)
order by dbms_random.value
)
where rownum < 101
group by a;
... with three runs I got:
A COUNT(*)
---------- ----------
1 71
2 29
A COUNT(*)
---------- ----------
1 100
A COUNT(*)
---------- ----------
1 64
2 36
Yes, 100% really came back as 1
on the second run. The skewing itself seems to be rather random. I tried with the block
modifier which seemed to make little difference, perhaps surprisingly - I might have thought it would get worse in this situation.
This is likely to be slower, certainly for small sample sizes, as it has to hit the entire table; but does give me pretty even splits fairly consistently:
select a, count(*) from (
select a, b from (
select a, b, row_number() over (order by dbms_random.value) as rn
from test1
)
where rn < 101
)
group by a;
With three runs I got:
A COUNT(*)
---------- ----------
1 48
2 52
A COUNT(*)
---------- ----------
1 57
2 43
A COUNT(*)
---------- ----------
1 49
2 51
... which looks a bit healthier. YMMV of course.
This Oracle article covers some sampling techniques, and you might want to evaluate the ora_hash
approach as well, and the stratified version if your data spread and your requirements for 'representativeness' demand it.
You can't trust SAMPLE
to return a truly random set of rows from a table. The algorithm appears to be based on the physical properties of the table.
create table test1(a number, b char(2000));
--Insert 10K fat records. A is always 1.
insert into test1 select 1, level from dual connect by level <= 10000;
--Insert 10K skinny records. A is always 2.
insert into test1 select 2, null from dual connect by level <= 10000;
--Select about 10 rows.
select * from test1 sample (0.1) order by a;
Run the last query multiple times and you will almost never see any 2s. This may be a accurate sample if you measure by bytes, but not by rows.
This is an extreme example of skewed data, but I think it's enough to show that RANDOM
doesn't work the way the manual implies it should. As others have suggested, you'll probably want to ORDER BY DBMS_RANDOM.VALUE
.
I've been fiddling about with a similar question. Firstly I set up what the sample size will be for the different Stratum. In your case it's only one. ('700064'). So in a with Clause or a temp table I did this:
Select DEPTID, Count(*) SAMPLE_ONE
FROM PS_LEDGER Sample(1)
WHERE DEPTID = '700064'
Group By DEPTID
This tells you the records in a 1% sample to expect. Lets call that TABLE_1
Then I did this:
Select
Ceil (Rank() over (Partition by DEPTID Order by DBMS_RANDOM.VALUE)
/ (Select SAMPLE_ONE From TABLE_1) STRATUM_GROUP
,A.*
FROM PS_LEDGER
Make that another table. What you get then is Random Sample Sets of approx. 1% in size.
So if your original table held 1000 records you would get 100 random sample sets with 10 items in each set.
you can then select one of these randomly to test.
Not sure if I've explained this very well, but it worked for me. I had 168 Stratum Set up on a table with over 10Mil records worked quite well.
If you want more explanation or can improve this please don't hesitate.
Regards
来源:https://stackoverflow.com/questions/16024737/sampling-from-oracle-need-exact-number-of-results-sample-clause