I need some assistance with splitting a large SAS dataset into smaller datasets.
Each month I\'ll have a dataset containing a few million records. This number will
A more efficient option, if you have room in memory to store one of the smaller datasets, is a hash solution. Here's an example using basically what you're describing in the question:
data in_data;
do recid = 1 to 1.000001e7;
datavar = 1;
output;
end;
run;
data _null_;
if 0 then set in_data;
declare hash h_out();
h_out.defineKey('_n_');
h_out.defineData('recid','datavar');
h_out.defineDone();
do filenum = 1 by 1 until (eof);
do _n_ = 1 to 250000 until (eof);
set in_data end=eof;
h_out.add();
end;
h_out.output(dataset:cats('file_',filenum));
h_out.clear();
end;
stop;
run;
We define a hash object with the appropriate parameters, and simply tell it to output every 250k records, and clear it. We could do a hash-of-hashes here also, particularly if it weren't just "Every 250k records" but some other criteria drove things, but then you'd have to fit all of the records in memory, not just 250k of them.
Note also that we could do this without specifying the variables explicitly, but it requires having a useful ID on the dataset:
data _null_;
if 0 then set in_data;
declare hash h_out(dataset:'in_data(obs=0)');
h_out.defineKey('recid');
h_out.defineData(all:'y');
h_out.defineDone();
do filenum = 1 by 1 until (eof);
do _n_ = 1 to 250000 until (eof);
set in_data end=eof;
h_out.add();
end;
h_out.output(dataset:cats('file_',filenum));
h_out.clear();
end;
stop;
run;
Since we can't use _n_
anymore for the hash ID due to using the dataset
option on the constructor (necessary for the all:'y'
functionality), we have to have a record ID. Hopefully there is such a variable, or one could be added with a view.