In R, find the column that contains a string in for each row

问题

I must be thinking in the wrong search terms because I cannot believe my question is unique, but I only found one similar.

I have some rather clunky data from the World Bank that is a flat file representing a database. The data are one project per row, but each project has multiple characteristics that are conveniently in columns with names like, "SECTOR.1" with it's own characteristics in other columns with names like, "SECTOR.1.PCT" etc.

From this, I'm trying to extract the data that are related to particular kinds of SECTOR, but I still need to have all of the other project information.

I have been able to make some steps in the right direction, from another question I found on SO: Find the index of the column in data frame that contains the string as value

A minimal reproducible example, based on the question notes above, is here:

> df <- data.frame(col1 = c(letters[1:4],"c"), 
...                  col2 = 1:5, 
...                  col3 = c("a","c","l","c","l"), 
...                  col4= letters[3:7])
> df
  col1 col2 col3 col4
1    a    1    a    c
2    b    2    c    d
3    c    3    l    e
4    d    4    c    f
5    c    5    l    g

The output that I want would be something like:

1 col4
2 col3
3 col1
4 col3
5 col1

I know that I could do an ifelse, but it does not seem like a very elegant approach. Certainly since this is something that I will do just 1 time (for this project), the risk of typos is little. For example,

> df$hasc <- ifelse(grepl("c",df$col1), "col1",
...                         ifelse(grepl("c",df$col2), "col2",
...                                ifelse(grepl("c",df$col3), "col3",
...                                       ifelse(grepl("c",df$col4), "col4",
...                                              NA))))
> df
  col1 col2 col3 col4 hasc
1    a    1    a    c col4
2    b    2    c    d col3
3    c    3    l    e col1
4    d    4    c    f col3
5    c    5    l    g col1

I think it would better if I had some kind of an apply function that would look at the columns by row. The method in the previous question does not work for this one because I need to know which column has the "c". I get something that doesn't make sense except that the column names with "c" are listed. I don't understand the 1,3,4 because that does not correspond to the rownames or the count:

>which(apply(df, 2, function(x) any(grepl("c", x))))
col1 col3 col4 
1    3    4

And, if I attempt to do it by row, I do see that each row has a "c", as expected.

 >which(apply(df, 1, function(x) any(grepl("c", x))))
[1] 1 2 3 4 5

ALSO -> I wonder if there is a way to handle this that would not break if there were "c" in more than one column for a row, for example, if we had:

> df <- data.frame(col1 = c(letters[1:4],"c"), 
...                  col2 = 1:5, 
...                  col3 = c("a","c","l","c","c"), 
...                  col4= letters[3:7])
> df
  col1 col2 col3 col4
1    a    1    a    c
2    b    2    c    d
3    c    3    l    e
4    d    4    c    f
5    c    5    c    g

My ifelse approach then fails because it just gives 'col1' for row5.

回答1:

Assuming that there is a single 'c' in each row of the dataset 'df', we can use max.col to get the column index where the row element is 'c', and use that to get the matching column names.

df$hasc <- colnames(df)[max.col(df=='c')]
df
#  col1 col2 col3 col4 hasc
#1    a    1    a    c col4
#2    b    2    c    d col3
#3    c    3    l    e col1
#4    d    4    c    f col3
#5    c    5    l    g col1

If you have more than one 'c' per row, one option would be to loop through the rows and paste the multiple column names together

df$hasc <- apply(df=='c', 1, FUN= function(x) toString(names(x)[x]))

回答2:

An alternative for the multiple matches case, which might be a bit quicker than running apply:

tmp <- which(df=="c", arr.ind=TRUE)
cnt <- ave(tmp[,"row"], tmp[,"row"], FUN=seq_along)
maxnames <- paste0("max",sequence(max(cnt)))
df[maxnames] <- NA
df[maxnames][cbind(tmp[,"row"],cnt)] <- names(df)[tmp[,"col"]]

#  col1 col2 col3 col4 max1 max2
#1    a    1    a    c col4 <NA>
#2    b    2    c    d col3 <NA>
#3    c    3    l    e col1 <NA>
#4    d    4    c    f col3 <NA>
#5    c    5    c    g col1 col3

来源：https://stackoverflow.com/questions/32217562/in-r-find-the-column-that-contains-a-string-in-for-each-row

标签

grepl