I was working previously with SAS and then decided to shift to R for academic requirements reasons. My data (healthdemo) are health data containing some health diagnostic co
I had a similar struggle when I transitioned from SAS to R for health-related research. My solution was to, as much as possible, let go the "if...then" approach and take advantage of some of R's unique native programming capabilities. Here are two approaches to your problem.
First, you can use indexing to find and replace elements. Here is some hospital discharge data of the kind you describe:
hosp<-read.csv(file="http://www.columbia.edu/~cjd11/charles_dimaggio/DIRE/resources/R/sparcsShort.csv",stringsAsFactors=F)
head(hosp)
Say I want to identify every birth-related diagnosis in Manhattan. I first create a logical vector that returns a series of TRUES and FALSES for my search criteria, then I index my data frame by that logical vector. In this case I am also restricting the columns or variables I want returned:
myObs<-hosp$county==59 & hosp$pdx=="V3000 " #note space
myVars<-c("age", "sex", "disp")
myFile<-hosp[myObs,myVars]
head(myFile)
The second, and perhaps more computationally elegant, approach is to use a function like "grep". Say you're interested in identifying all substance abuse diagnoses, e.g. alcohol abuse (291, 303, 305 and sub-codes), opioids, cannabis, amphetamines, hallucinogenics, and cocaine (304 and related sub-codes), or non-specific substance abuse-related diagnoses (292). In SAS you would write out a long if-then statement (or a more efficient array) of some kind:
#/*********************** SUBSTANCE ABUSE *****************/
#if pdx in /* use ICD9 codes to create diagnoses */ (’2910’,’2911’,’2912’,’2913’,’2914’,’2915’,
# ’29181’,’29189’, ’2919’,’2920’,’29211’,’29212’,’2922’,’29281’,’29282’,’29283’, #........etc....,’30592’,’30593’)
#Then subst_ab=1;
#Else subst_ab=0;
In R, you can instead write:
substance<-grep("^291[0-9,0-9]|^292[0-9,0-9]|^303[0-9,0-9]|^304[0-9,0-9]^305[0-9,0-9]", hosp$pdx)
hosp$pdx[substance]
hosp$subsAb<-"No"
hosp$subsAb[substance]<-"Yes"
hosp$subsAb[1:100]
table(hosp$subsAb)
plot(table(hosp$subsAb))
library(ggplot2)
qplot(subsAb, age,data=hosp, alpha = I(1/50))
Tomas Aragon has written a wonderful introduction to R for epidemiologists that goes into these approaches in detail. (http://www.medepi.net/docs/ph251d_fall2012_epir-chap01-04.pdf)
I created the icd package to solve this kind of problem. You can use standard groups of diseases, or create your own. It can then quickly plough through all your codes and assign disease groups to each patient. It works with ICD-9 and ICD-10 codes.
I found plain text processing (like grep
in previous answer) was both slow and unreliable. ICD codes have numerous variations in how they are recorded, e.g. an ICD-9 code like X91.9
is equivalent to 0919. String processing for hundreds of thousands of rows was far too slow for me using R functions efficiently, so I wrote the package using a lot of C++, so bigger data users can assign comorbidities to a million patients in a couple of seconds. Hope this helps.
I suppose the problem is du to icd_num
not being numeric.
Use the following command to create this variable:
healthdemo$icd_num <- as.numeric(substr(healthdemo$ICDCODE, 2,
nchar(healthdemo$ICDCODE)))
(If you want to get rid of the numbers after the .
, replace as.numeric
with as.integer
.)
Then your first approach should work:
healthdemo$cvd[healthdemo$icd_char == 'I' &
01 <= healthdemo$icd_num &
healthdemo$icd_num < 52 ] <- 1
The behavior of IF ... THEN >>> in SAS is achieved by the use NOT of if(...){...} but rather of ifelse(..., ..., ...). And you cannot use the form a < var < b
. Furthermore you have not quite gotten the functional paradigm of R programming.
Try this instead your last statement:
healthdemo$cvd <- NA # initialize to missing
healthdemo$cvd <- ifelse (healthdemo$icd_char == "I" &
01 <= healthdemo$icd_num &
healthdemo$icd_num < 52 , 1, healthdemo$cvd )
Note that the form: var <- ifelse(logicalvec, value, var)
allows you to do selective replacements. The old value is the default and only the "parallel" value of TRUE in the logical vector triggers a change.
Robert Muenchen has written a book entitled something along the lines of 'R for SAS and SPSS Users'. There's also a freely available draft version that about 70 page long that should show up with a web search.