Determining if one value occurs once in a row of columns, but a second value doesn't occur at all

╄→гoц情女王★ 提交于 2019-12-05 13:08:08

Here's a solution without tidyverse and even *apply functions. First, let's convert those four columns to integers:

cols <- 3:6
df1[cols] <- lapply(df1[cols], as.integer)

Then

df <- df1[rowSums(df1[cols]) == (3 + length(cols) - 1) & rowSums(df1[cols] == 3) == 1, ]
df$Group <- names(df)[cols][which(t(df[cols]) == 3, arr.ind = TRUE)[, 1]]
df
# A tibble: 2 x 7
#   seq                       loc    Ball   Cat Square Water Group 
#   <fct>                     <fct> <int> <int>  <int> <int> <chr> 
# 1 AAAAAACCAGTCCCAGTTCGGATTG t         3     1      1     1 Ball  
# 2 AAAAAACCGGTCACAGTTCAGATTG t         1     1      3     1 Square

In the first line I select the right rows with two conditions: there has to be only one element equal to 3 in those cols columns (rowSums(df1[cols] == 3) == 1) and the total sum of the row has to be 3 + length(cols) - 1. Then in the second row I check which columns have 3 and pick corresponding names of df as values for Group.

I often use the basic apply when doing rowwise calculations. You could do something with the actual dplyr::rowwise if you wanted a tidyverse solution. Here's just using base R:

filter_on = apply(X = df1[3:6], 
                  MARGIN = 1, 
                  FUN = function(x){sum(x == 3) == 1 & sum(x == 1) == 3})
df1 = df1[filter_on,]

columns = colnames(df1)[3:6]

df1$Group = unlist(apply(X = df1[3:6], 
                         MARGIN = 1,
                         FUN = function(x){columns[x == 3]}))

Just to show an alternative where we work with data in long format instead of row-wise. Here, using data.table functions:

library(data.table)
d <- melt(setDT(df1), id.vars = c("seq", "loc"))
d[d[ , .I[sum(value == 3) == 1 & !any(value == 2)], by = .(seq, loc)]$V1][value == 3]
#                          seq loc variable value
# 1: AAAAAACCAGTCCCAGTTCGGATTG   t     Ball     3
# 2: AAAAAACCGGTCACAGTTCAGATTG   t   Square     3

melt data to long format using 'sec' and 'loc' as id variables. If the combination of 'sec' and 'loc' are not unique identifiers of rows, create a unique row index (e.g. ri := 1:.N).

For each 'sec' and 'loc' (by = .(seq, loc); i.e. for each row in original data), create a logical vector for the desired condition: one 3 and no 2 per row (sum(value == 3) == 1 & !any(value == 2)). Grab corresponding row indexes (.I). The indexes, auto-named 'V1', are then used to subset 'd'.

Finally, select rows where 'value' equals 3 ([value == 3]).

Putting in an extra version. This only covers the row selection.

#create vector of wanted column names
cols <- c("Ball", "Cat", "Square", "Water")
#make values numeric
df1[, cols] <- df1[, cols] %>% mutate_if(is.character, as.numeric)

#filter rows
df1[which((rowSums(df1[, cols]) == (length(cols)+2) ) & (rowSums(df1[, cols] == 2) == 0)),]

                        seq loc Ball Cat Square Water
1 AAAAAACCAGTCCCAGTTCGGATTG   t    3   1      1     1
5 AAAAAACCGGTCACAGTTCAGATTG   t    1   1      3     1

Looks like the apply version is fastest of the first three posts, but not by much.

microbenchmark::microbenchmark(
which = df1[which((rowSums(df1[, cols]) == (length(cols)+2) ) & (rowSums(df1[, cols] == 2) == 0)),],
filter = df1[rowSums(df1[cols]) == (3 + length(cols) - 1) & rowSums(df1[cols] == 3) == 1, ],
apply = df1[apply(X = df1[3:6], 
          MARGIN = 1, 
          FUN = function(x){sum(x == 3) == 1 & sum(x == 1) == 3}),]
)

Unit: microseconds
   expr     min       lq     mean  median       uq      max neval cld
  which 429.043 436.4665 446.2817 445.811 451.3140  493.553   100   a
 filter 429.555 435.5715 447.8151 440.307 449.2670  724.202   100   a
  apply 339.958 346.9975 435.0437 351.222 362.2295 8141.819   100   a

My solution was a take off of @Julius Vainora.. Mine is more convoluted but I used match() and added an index column.

DF$index <- seq.int(nrow(DF))
col_names <- names(DF)[3:ncol(DF)]

DF$Group <- col_names[which(DF[cols] == 3, arr.ind = TRUE)[,2][
  DF$index[match(
    DF$index, which(
       DF[cols] == 3, arr.ind = TRUE[,1])]]]
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!