Partitioned Running Totals in SAS

后端 未结 4 452
忘掉有多难
忘掉有多难 2021-01-24 17:51

How can I create a new column for an existing dataset that is the running total of an existing column - partitioned by some identifier?

ID |  Value |     New Val         


        
4条回答
  •  再見小時候
    2021-01-24 18:45

    I generally do running totals in SAS using a data step rather than proc sql. You could theoretically do that with proc sql and a cross-join, but that's not very practical in the real-world.

    Step 1: Sort the data by your ID value

    proc sort data=have;
         by ID;
    run;
    

    Step 2: Calculate the running total using a sum statement

    data want;
        set have;
        by ID;
    
        New_Value+Value;
    
        output;
    
        if(last.ID) then New_Value = 0;
     run;
    

    Explanation

    By-group processing is one of SAS's most powerful tools. When you sort by your ID value and use a by-statement on it in a data step, you unlock two new special boolean variables: first. and last.. SAS knows exactly when a group of ID variables starts and stops when you do by-group processing (you can actually use a special option called notsorted to get away without sorting the ID values, but that is an advanced concept). In your case, let's look and see how this works:

    ID Value first.ID last.ID
    1  10    1        0
    1  5     0        0
    1  3     0        1
    2  45    1        0 
    2  15    0        1
    

    If there was only one observation in an ID group, first.ID and last.ID would both be 1.

    There is a special statement in SAS called a sum statement that does the job of both retaining and summing up a variable. SAS is inherently a looping language; whenever it runs through the data step, it is looking at one record and one record only. When it hits the bottom of the data step and goes back up to the top, begins to read the next record in the table, and assumes all variables are now missing again until it reads or calculates the variables' value. The thing that you are literally outputting are the contents of something called the Program Data Vector. In the background, this is what you are manipulating.

    By default, it will output to a data set only after it hits the run boundary. If you explicitly tell SAS to output, it will output only when you tell it to. To cumulatively sum up a variable, we want to tell SAS four things:

    1. Retain the previous value of a variable (don't reset it to missing on a when reading a new record)
    2. Add the variable to itself
    3. Output after I add the value to itself
    4. After I output and it's the last observation in a group of ID variables, reset my cumulative variable to 0

    To achieve Step 1 & 2, you can do two options:

     data want;
          set have;
          retain New_Value;
    
          New_Value = sum(New_Value, Value);
    

    OR

     data want;
           set have;
           New_Value+Value;
    

    Note that the 2nd option does exactly what the first does but with less work. Think of it as a shortcut. This is called a Sum Statement. It looks like it's syntactically wrong, but it's a special, and very useful, case.

    To achieve step 3, we simply tell SAS to output right afterwards rather than at the end of the data step by default:

    data want;
        set have;
    
        New_Value+Value;
    
        output;
    

    If you run the code above as it is, New_Value will add up indefinitely until it reaches the very end of the file. We want to reset this value once it reaches a new ID group. We use by-group processing to unlock those two boolean variables to reset New_Value to 0 during a specified condition after we output:

    data want;
        set have;
        by ID;
    
        New_Value+Value;
    
        output;
    
        if(last.ID) then New_Value = 0;
    run;
    

    New_Value will not be reset to 0 unless we are in the very last of observation of the specified ID group. Note that we put the conditional if statement below the output statement. If this were above, you would see the following phenomenon:

    ID Value New_Value first.ID last.ID
    1  10    10         1        0
    1  5     15         0        0
    1  3      0         0        1
    2  45    45         1        0 
    2  15     0         0        1
    

    We want to output the cumulative sum before New_Value is reset to 0.

    There are some other SAS procedures that you can use to do similar things, but they are designed for specific situations. You can repurpose them to do what you want in this case, but it's better to learn data step processing before jumping into procedure repurposing.

提交回复
热议问题