Is there an efficient way of transposing huge table in SAS

前端 未结 3 409
攒了一身酷
攒了一身酷 2021-01-23 17:42

I have a data set in SAS that I need to transpose. It has the form id date type value and I need to convert it into id date valueoftype1 valueoftype2 ...

Is there any

3条回答
  •  闹比i
    闹比i (楼主)
    2021-01-23 18:16

    PROC TRANSPOSE will do this very, very efficiently, I'd venture to say equal to or better than the most efficient method of any other DBMS out there. Your data is already beautifully organized for that method, also. You just need a sort by ID DATE, unless you already have an index for that combination (which if you have billions of records is a necessity IMO). No other solution will come close, unless you have enough memory to put it all in memory - which would be rather insane for that size dataset (even 1 billion records would be a minimum of 7GB, and if you have millions of IDs then it's clearly not a 1 byte ID; i'd guess 25-30 GB or more.)

    proc sort data=one;
    by id date;
    run;
    proc transpose data=one out=want;
    by id date;
    id type;
    var value;
    run;
    

    A naive test on my system, with the following:

    data one; 
    do id = 1 to 1e6;
      do date = '01JAN2010'd to '01JAN2012'd;
        type = byte(ceil(ranuni(7)*26)+64);
        value = ceil(ranuni(7)*20);
        output;
      end;
    end;
    run;
    proc sort data=one;
    by id date;
    run;
    proc transpose data=one out=want;
    by id date;
    id type;
    var value;
    run;
    

    That dataset is ~20GB compressed (OPTIONS COMPRESS=YES). It took about 4 minutes 15 seconds to write initially, took 11 minutes to sort, and took 45 minutes to PROC TRANSPOSE, writing a ~100GB compressed file. I'd guess that's the best you can do; of those 45 minutes, over 20 were likely writing out (5x bigger dataset will take over 5x the time to write out, plus compression overhead); I was also doing other things at the time, so the CPU time was probably inflated some as it didn't get my entire processor (this is my desktop, a 4 core i5). I don't think this is particularly unreasonable processing time at all.

    You might consider looking at your needs, and perhaps a transpose isn't really what you want - do you really want to grow your table that much? Odds are you can achieve your actual goal (your analysis/etc.) without transposing the entire dataset.

提交回复
热议问题