How can I create a new column for an existing dataset that is the running total of an existing column - partitioned by some identifier?
ID | Value | New Val
Joe - your answer did not work for whatever reason, but got me on the right track to figure it out. Thank you!
data want;
set have;
by id;
if first.id then running_total = 0;
if first.id then retained_total = 0;
running_total = retained_total + value;
retained_total = running_total;
retain retained_total;
run;
Here's another option:
data want;
do until (last.id);
set have;
by id;
new_value + value;
output;
end;
new_value = 0;
run;
I generally do running totals in SAS using a data
step rather than proc sql
. You could theoretically do that with proc sql
and a cross-join, but that's not very practical in the real-world.
Step 1: Sort the data by your ID value
proc sort data=have;
by ID;
run;
Step 2: Calculate the running total using a sum statement
data want;
set have;
by ID;
New_Value+Value;
output;
if(last.ID) then New_Value = 0;
run;
Explanation
By-group processing is one of SAS's most powerful tools. When you sort by your ID value and use a by-statement on it in a data step, you unlock two new special boolean variables: first.<by-group variable>
and last.<by-group variable>.
SAS knows exactly when a group of ID variables starts and stops when you do by-group processing (you can actually use a special option called notsorted
to get away without sorting the ID values, but that is an advanced concept). In your case, let's look and see how this works:
ID Value first.ID last.ID
1 10 1 0
1 5 0 0
1 3 0 1
2 45 1 0
2 15 0 1
If there was only one observation in an ID group, first.ID
and last.ID
would both be 1.
There is a special statement in SAS called a sum statement that does the job of both retaining and summing up a variable. SAS is inherently a looping language; whenever it runs through the data
step, it is looking at one record and one record only. When it hits the bottom of the data
step and goes back up to the top, begins to read the next record in the table, and assumes all variables are now missing again until it reads or calculates the variables' value. The thing that you are literally outputting are the contents of something called the Program Data Vector. In the background, this is what you are manipulating.
By default, it will output to a data set only after it hits the run
boundary. If you explicitly tell SAS to output, it will output only when you tell it to. To cumulatively sum up a variable, we want to tell SAS four things:
To achieve Step 1 & 2, you can do two options:
data want;
set have;
retain New_Value;
New_Value = sum(New_Value, Value);
OR
data want;
set have;
New_Value+Value;
Note that the 2nd option does exactly what the first does but with less work. Think of it as a shortcut. This is called a Sum Statement. It looks like it's syntactically wrong, but it's a special, and very useful, case.
To achieve step 3, we simply tell SAS to output right afterwards rather than at the end of the data step by default:
data want;
set have;
New_Value+Value;
output;
If you run the code above as it is, New_Value
will add up indefinitely until it reaches the very end of the file. We want to reset this value once it reaches a new ID group. We use by-group processing to unlock those two boolean variables to reset New_Value
to 0 during a specified condition after we output:
data want;
set have;
by ID;
New_Value+Value;
output;
if(last.ID) then New_Value = 0;
run;
New_Value
will not be reset to 0 unless we are in the very last of observation of the specified ID group. Note that we put the conditional if
statement below the output statement. If this were above, you would see the following phenomenon:
ID Value New_Value first.ID last.ID
1 10 10 1 0
1 5 15 0 0
1 3 0 0 1
2 45 45 1 0
2 15 0 0 1
We want to output the cumulative sum before New_Value
is reset to 0.
There are some other SAS procedures that you can use to do similar things, but they are designed for specific situations. You can repurpose them to do what you want in this case, but it's better to learn data
step processing before jumping into procedure repurposing.
In the data step this is accomplished with the sum statement
.
data want;
set have;
by id;
if first.id then running_total=0;
running_Total + value;
run;
In PROC SQL this would be impossible unless you have an ordering variable (in which you could do something like this):
proc sql;
create table want as
select id, value,
(select sum(value) from have V
where V.id=H.id and V.ordervar le H.ordervar
) as running_total
from have H
;
quit;
But SAS doesn't have a partition by
concept - the SAS data step is far more powerful than that.