I need some assistance with splitting a large SAS dataset into smaller datasets.
Each month I\'ll have a dataset containing a few million records. This number will
A more efficient option, if you have room in memory to store one of the smaller datasets, is a hash solution. Here's an example using basically what you're describing in the question:
data in_data;
do recid = 1 to 1.000001e7;
datavar = 1;
output;
end;
run;
data _null_;
if 0 then set in_data;
declare hash h_out();
h_out.defineKey('_n_');
h_out.defineData('recid','datavar');
h_out.defineDone();
do filenum = 1 by 1 until (eof);
do _n_ = 1 to 250000 until (eof);
set in_data end=eof;
h_out.add();
end;
h_out.output(dataset:cats('file_',filenum));
h_out.clear();
end;
stop;
run;
We define a hash object with the appropriate parameters, and simply tell it to output every 250k records, and clear it. We could do a hash-of-hashes here also, particularly if it weren't just "Every 250k records" but some other criteria drove things, but then you'd have to fit all of the records in memory, not just 250k of them.
Note also that we could do this without specifying the variables explicitly, but it requires having a useful ID on the dataset:
data _null_;
if 0 then set in_data;
declare hash h_out(dataset:'in_data(obs=0)');
h_out.defineKey('recid');
h_out.defineData(all:'y');
h_out.defineDone();
do filenum = 1 by 1 until (eof);
do _n_ = 1 to 250000 until (eof);
set in_data end=eof;
h_out.add();
end;
h_out.output(dataset:cats('file_',filenum));
h_out.clear();
end;
stop;
run;
Since we can't use _n_
anymore for the hash ID due to using the dataset
option on the constructor (necessary for the all:'y'
functionality), we have to have a record ID. Hopefully there is such a variable, or one could be added with a view.
Here is a basic approach. This requires manual adjustment of the intervals, but is easy to understand.
* split data;
data output1;
set df;
if 1 <= _N_ < 5 then output;
run;
data output2;
set df;
if 5 <= _N_ < 10 then output;
run;
data output3;
set df;
if 10 <= _N_ < 15 then output;
run;
data output4;
set df;
if 15 <= _N_ < 22 then output;
run;
The first result on Google is from the SAS User Group International (SUGI) These folks are your friends.
The article is here: http://www2.sas.com/proceedings/sugi27/p083-27.pdf
The code is:
%macro split(ndsn=2);
data %do i = 1 %to &ndsn.; dsn&i. %end; ;
retain x;
set orig nobs=nobs;
if _n_ eq 1
then do;
if mod(nobs,&ndsn.) eq 0
then x=int(nobs/&ndsn.);
else x=int(nobs/&ndsn.)+1;
end;
if _n_ le x then output dsn1;
%do i = 2 %to &ndsn.;
else if _n_ le (&i.*x)
then output dsn&i.;
%end;
run;
%mend split;
%split(ndsn=10);
All you need to do is replace the 10 digit in "%split(ndsn=10);" with the number you require. In Line 4, "set orig nobs=nobs;", simply replace orig with your dataset name.
Hey presto!
You can do it without macros, using CALL EXECUTE(). It creates SAS-code as text strings and then executes it, after your "manually written" code completed.
data _null_;
if 0 then set have nobs=n;
do i=1 to ceil(n/250000);
call execute (cats("data want",i)||";");
call execute ("set have(firstobs="||(i-1)*250000+1||" obs="||i*250000||");");
call execute ("run;");
end;
run;
You can do it without macros at all, if you don't mind asking for datasets that may not exist, and have a reasonable bound on things.
data want1 want2 want3 want4 want5 want6 want7 want8 want9;
if _n_ le 2.5e5 then output want1;
else if _n_ le 5e5 then output want2;
else if _n_ le 7.5e5 then output want3;
... etc....
run;
Macros would make that more efficient to program and cleaner to read, but wouldn't change how it actually runs in reality.
Building on Joe's answer, maybe you could try something like this :
%MACRO SPLIT(DATASET);
%LET DATASET_ID = %SYSFUNC(OPEN(&DATASET.));
%LET NOBS = %SYSFUNC(ATTRN(&DATASET__ID., NLOBS));
%LET NB_DATASETS = %SYSEVALF(&NOBS. / 250000, CEIL);
DATA
%DO I=1 %TO &NB_DATASETS.;
WANT&I.
%END;;
SET WANT;
%DO I=1 %TO &NB_DATASETS.;
%IF &I. > 1 %THEN %DO; ELSE %END; IF _N_ LE 2.5E5 * &I. THEN OUTPUT WANT&I.;
%END;
RUN;
%MEND SPLIT;