Is there an efficient way of transposing huge table in SAS

前端 未结 3 408
攒了一身酷
攒了一身酷 2021-01-23 17:42

I have a data set in SAS that I need to transpose. It has the form id date type value and I need to convert it into id date valueoftype1 valueoftype2 ...

Is there any

相关标签:
3条回答
  • 2021-01-23 18:16

    PROC TRANSPOSE will do this very, very efficiently, I'd venture to say equal to or better than the most efficient method of any other DBMS out there. Your data is already beautifully organized for that method, also. You just need a sort by ID DATE, unless you already have an index for that combination (which if you have billions of records is a necessity IMO). No other solution will come close, unless you have enough memory to put it all in memory - which would be rather insane for that size dataset (even 1 billion records would be a minimum of 7GB, and if you have millions of IDs then it's clearly not a 1 byte ID; i'd guess 25-30 GB or more.)

    proc sort data=one;
    by id date;
    run;
    proc transpose data=one out=want;
    by id date;
    id type;
    var value;
    run;
    

    A naive test on my system, with the following:

    data one; 
    do id = 1 to 1e6;
      do date = '01JAN2010'd to '01JAN2012'd;
        type = byte(ceil(ranuni(7)*26)+64);
        value = ceil(ranuni(7)*20);
        output;
      end;
    end;
    run;
    proc sort data=one;
    by id date;
    run;
    proc transpose data=one out=want;
    by id date;
    id type;
    var value;
    run;
    

    That dataset is ~20GB compressed (OPTIONS COMPRESS=YES). It took about 4 minutes 15 seconds to write initially, took 11 minutes to sort, and took 45 minutes to PROC TRANSPOSE, writing a ~100GB compressed file. I'd guess that's the best you can do; of those 45 minutes, over 20 were likely writing out (5x bigger dataset will take over 5x the time to write out, plus compression overhead); I was also doing other things at the time, so the CPU time was probably inflated some as it didn't get my entire processor (this is my desktop, a 4 core i5). I don't think this is particularly unreasonable processing time at all.

    You might consider looking at your needs, and perhaps a transpose isn't really what you want - do you really want to grow your table that much? Odds are you can achieve your actual goal (your analysis/etc.) without transposing the entire dataset.

    0 讨论(0)
  • 2021-01-23 18:19
    if first.date then a=.;b=.;c=.;d=.; 
    

    has to be replaced with:

    if first.date then do;
        a=.;b=.;c=.;d=.;
    end;
    

    or

    if first.date then call missing(a,b,c,d);
    

    Also instead of

    if last.date then do; output; a=.;b=.;c=.;d=.; end;
    

    now, it should be enough :

    if last.date then output;
    

    I guess a datastep will always be more efficient then PROC TRANSPOSE on big data. The limit is you have to find out of distinct values of the transposed variable and create new variables for those. I think that's the overhead of PROC TRANSPOSE - it first finds out the values. (I'm sorry I edited your own answer, so it might not be clear now what was the problem.)

    0 讨论(0)
  • 2021-01-23 18:37

    An alternative data step approach (DOW-loop):

    proc sort data = one;
      by ID date;
    run;
    
    data two;
      do _n_ = 1 by 1 until(last.date);
        set one;
        by ID DATE;
        if type = "A" then valA = value;
        else if type = "B" then valB = value;
        else if type = "C" then valC = value;
      end;
      drop value;
    run;
    

    On my system, using a dataset 1/10 of the size of the one Joe used, it took 2 minutes to sort and 9 minutes 40 seconds to use proc transpose. The DOW loop did the same thing in 7 minutes 4 seconds. In this particular scenario, it isn't terribly impressive, but it has one big advantage over proc transpose: you can use it to transpose multiple variables in a single pass. Here's the code I used:

    data one; 
    do id = 1 to 1e5;
      do date = '01JAN2010'd to '01JAN2012'd;
        type = byte(ceil(ranuni(7)*26)+64);
        value = ceil(ranuni(7)*20);
        output;
      end;
    end;
    run;
    
    data two;
      do _n_ = 1 by 1 until(last.DATE);
        set one;
        array vals[26] val65-val90;
        by ID DATE;
        do i = 1 to 26;
          if type = byte(64 + i) then vals[i] = value;
        end;
      end;
      drop value i;
    run;
    

    Renaming all 26 transposed type variables dynamically is a bit tricky, but this can be done via call execute:

    data _null_;
      call execute('proc datasets lib = work nolist;');
      call execute('modify two;');
      call execute('rename');
      do i = 1 to 26;
        call execute(compress('val' || i + 64) || ' = ' || compress('val' || byte(64+i)));
      end;
      call execute(';');
      call execute('run;');
      call execute('quit;');
    run;
    
    0 讨论(0)
提交回复
热议问题