问题
I have a large data frame, each row of which refers to an admission to hospital. Each admission is accompanied by up to 20 diagnosis codes in columns 5 to 24.
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20
data data data data J123 F456 H789 E468
data data data data T452 NA NA NA
Separately, I have a vector (risk_codes) of length 136, all strings. These strings are risk codes that can be similar to the truncated diagnosis codes (e.g. J12 would be ok, F4 would be ok, H798 would not).
I wish to add a column to the data frame that returns 1 if any of the risk codes are similar to any of the diagnosis codes. I don't need to know how many, just that at least one is.
So far, I've tried the following with the most success over other attempts:
for (in in 1:length(risk_codes){
df$newcol <- apply(df,1,function(x) sum(grepl(risk_codes[i], x[c(5:24)])))
}
It works well for a single string and populates the column with 0 for no similar codes and 1 for a similar code, but then everything is overwritten when the second code is checked, and so on over the 136 elements of the risk_codes vector.
Any ideas, please? Running a loop over every risk_code in every column for every row would not be feasible.
The solution would look like this
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20 newcol
data data data data J123 F456 H789 E468 1
data data data data T452 NA NA NA 0
if my risk_codes contained J12, F4, T543, for example.
回答1:
We want to apply the grepl with all the risk_codes at once. So we get one result per row at once. We can do that with sapply
and any
.
So, we can drop the for loop and your code becomes like this:
my_df <- read.table(text="Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20
data data data data J123 F456 H789 E468
data data data data T452 NA NA NA", header=TRUE)
risk_codes <- c("F456", "XXX") # test codes
my_df$newcol <- apply(my_df,1,function(x)
any(sapply(risk_codes,
function(codes) grepl(codes,
x[c(5:24)]))))
The result is a logical vector.
If you still want to use 1 and 0 instead of the TRUE/FALSE, you just need to finish with:
my_df$new_col <- ifelse(my_df$newcol, 1, 0)
The result will be:
> my_df
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20 newcol
1 data data data data J123 F456 H789 E468 1
2 data data data data T452 <NA> <NA> <NA> 0
来源:https://stackoverflow.com/questions/35936658/r-return-boolean-if-any-strings-in-a-vector-appear-in-any-of-several-columns