How can I create a new column for an existing dataset that is the running total of an existing column - partitioned by some identifier?
ID | Value | New Val
I generally do running totals in SAS using a data
step rather than proc sql
. You could theoretically do that with proc sql
and a cross-join, but that's not very practical in the real-world.
Step 1: Sort the data by your ID value
proc sort data=have;
by ID;
run;
Step 2: Calculate the running total using a sum statement
data want;
set have;
by ID;
New_Value+Value;
output;
if(last.ID) then New_Value = 0;
run;
Explanation
By-group processing is one of SAS's most powerful tools. When you sort by your ID value and use a by-statement on it in a data step, you unlock two new special boolean variables: first.
and last.
SAS knows exactly when a group of ID variables starts and stops when you do by-group processing (you can actually use a special option called notsorted
to get away without sorting the ID values, but that is an advanced concept). In your case, let's look and see how this works:
ID Value first.ID last.ID
1 10 1 0
1 5 0 0
1 3 0 1
2 45 1 0
2 15 0 1
If there was only one observation in an ID group, first.ID
and last.ID
would both be 1.
There is a special statement in SAS called a sum statement that does the job of both retaining and summing up a variable. SAS is inherently a looping language; whenever it runs through the data
step, it is looking at one record and one record only. When it hits the bottom of the data
step and goes back up to the top, begins to read the next record in the table, and assumes all variables are now missing again until it reads or calculates the variables' value. The thing that you are literally outputting are the contents of something called the Program Data Vector. In the background, this is what you are manipulating.
By default, it will output to a data set only after it hits the run
boundary. If you explicitly tell SAS to output, it will output only when you tell it to. To cumulatively sum up a variable, we want to tell SAS four things:
To achieve Step 1 & 2, you can do two options:
data want;
set have;
retain New_Value;
New_Value = sum(New_Value, Value);
OR
data want;
set have;
New_Value+Value;
Note that the 2nd option does exactly what the first does but with less work. Think of it as a shortcut. This is called a Sum Statement. It looks like it's syntactically wrong, but it's a special, and very useful, case.
To achieve step 3, we simply tell SAS to output right afterwards rather than at the end of the data step by default:
data want;
set have;
New_Value+Value;
output;
If you run the code above as it is, New_Value
will add up indefinitely until it reaches the very end of the file. We want to reset this value once it reaches a new ID group. We use by-group processing to unlock those two boolean variables to reset New_Value
to 0 during a specified condition after we output:
data want;
set have;
by ID;
New_Value+Value;
output;
if(last.ID) then New_Value = 0;
run;
New_Value
will not be reset to 0 unless we are in the very last of observation of the specified ID group. Note that we put the conditional if
statement below the output statement. If this were above, you would see the following phenomenon:
ID Value New_Value first.ID last.ID
1 10 10 1 0
1 5 15 0 0
1 3 0 0 1
2 45 45 1 0
2 15 0 0 1
We want to output the cumulative sum before New_Value
is reset to 0.
There are some other SAS procedures that you can use to do similar things, but they are designed for specific situations. You can repurpose them to do what you want in this case, but it's better to learn data
step processing before jumping into procedure repurposing.