Split large SAS dataset into smaller datasets

前端 未结 6 487
长情又很酷
长情又很酷 2021-01-13 01:10

I need some assistance with splitting a large SAS dataset into smaller datasets.

Each month I\'ll have a dataset containing a few million records. This number will

6条回答
  •  攒了一身酷
    2021-01-13 01:45

    A more efficient option, if you have room in memory to store one of the smaller datasets, is a hash solution. Here's an example using basically what you're describing in the question:

    data in_data;
      do recid = 1 to 1.000001e7;
        datavar = 1;
        output;
      end;
    run;
    
    
    data _null_;
      if 0 then set in_data;
      declare hash h_out();
      h_out.defineKey('_n_');
      h_out.defineData('recid','datavar');
      h_out.defineDone();
    
      do filenum = 1 by 1 until (eof);
        do _n_ = 1 to 250000 until (eof);
          set in_data end=eof;
          h_out.add();
        end;
        h_out.output(dataset:cats('file_',filenum));
        h_out.clear();
      end;
      stop;
    run;
    

    We define a hash object with the appropriate parameters, and simply tell it to output every 250k records, and clear it. We could do a hash-of-hashes here also, particularly if it weren't just "Every 250k records" but some other criteria drove things, but then you'd have to fit all of the records in memory, not just 250k of them.

    Note also that we could do this without specifying the variables explicitly, but it requires having a useful ID on the dataset:

    data _null_;
      if 0 then set in_data;
      declare hash h_out(dataset:'in_data(obs=0)');
      h_out.defineKey('recid');
      h_out.defineData(all:'y');
      h_out.defineDone();
    
      do filenum = 1 by 1 until (eof);
        do _n_ = 1 to 250000 until (eof);
          set in_data end=eof;
          h_out.add();
        end;
        h_out.output(dataset:cats('file_',filenum));
        h_out.clear();
      end;
      stop;
    run;
    

    Since we can't use _n_ anymore for the hash ID due to using the dataset option on the constructor (necessary for the all:'y' functionality), we have to have a record ID. Hopefully there is such a variable, or one could be added with a view.

提交回复
热议问题