I am working with a large data set of billing records for my clinical practice over 11 years. Quite a few of the rows are missing the referring physician. However,
@MatthewDowle has provided us with a wonderful starting point and here we will take it to its conclusion.
In a nutshell, use zoo's na.locf
. The problem is not amenable to rolling joins.
setDT(bill)
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE),
by=list(patient.last.name, patient.first.name, medical.record.nr)]
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE,fromLast=TRUE),
by=list(patient.last.name, patient.first.name, medical.record.nr)]
Then do something similar for referring.doctor.first
A few pointers:
The by
statement ensures that the last observation carried forward is restricted to the same patient so that the carrying does not "bleed" into the next patient on the list.
One must use the na.rm=FALSE
argument. If one does not then a patient who is missing information for a referring physician on their very first visit will have the NA
removed and the vector of new values (existing + carried forward) will be one element short of the number of rows. The shortened vector is recycled and everything gets shifted up and the last row gets the first element of the vector as it is recycled. In other words, a big mess. And worst of all you will only see it sometimes.
Use fromLast=TRUE
to run through the column again. That fills in the NA that preceded any data. Instead of last observation carried forward (LOCF) zoo uses next observation carried backward (NOCB). Happiness - you have now filled in the missing data in a way that is correct for most circumstances.
You can pass multiple :=
per line, e.g. DT[,`:=`(new=1L,new2=2L,...)]
A more concise example would have been easier to answer. For example you've included quite a few columns that appear to be redundant. Does it really need to be by first name and last name, or can we use the patient number?
Since you already have NA
s in the data, that you wish to fill, it's not roll
in data.table
really. A rolling join is more for when your data has no NA
but you have another time series (for example) that joins to positions inbetween the data. (One efficiency advantage there is the very fact you don't create NA
first which you then have to fill in a 2nd step.) Or, in other words, in your question you just have one dataset; you aren't joining two.
So you do need na.locf
as @Joshua suggested. I'm not aware of a function that fills NA
forward and then the first value backwards, though.
In data.table
, to use na.locf
by group it's just :
require(data.table)
require(zoo)
DT[,doctor:=na.locf(doctor),by=patient]
which has the efficiency advantages of fast aggregation and update by reference. You would have to write a new small function on top of na.locf
to roll the first non NA
backwards.
Ensure the data is sorted by patient then date, first. Then the above will cope with changes in doctor over time, since by
maintains the order of rows within each group.
Hope that gives you some hints.