问题
Probably a terrible title, but I have a table of qualifiers stored as "1", "2", and "3". What I'm trying to do is is look in each row (approximately 300,000 rows, but variable.) and determine where a single "3" occurs, (if it occurs more than once, I am not interested in it) and the rest of the columns in that row have a "1", and return that to a list. (The number of columns and column names change based on the input files.)
Instinctively I want to attempt this by doing nested for loops that index the row count, and then the column count, then some function that looks for one "3" and no "2"'s. --which likely means the preferred way would be some apply function correct?
Another though was to total the number of columns, add 2, and then sum the row while having a qualifier that no 2's can be in the row. But that seemed pretty complicated.
df1
seq loc Ball Cat Square Water
1 AAAAAACCAGTCCCAGTTCGGATTG t 3 1 1 1
2 AAAAAACCAGTCTCAGTTCGGATTG b 1 1 3 3
3 AAAAAACCAGTCTCAGTTCGGATTG t 1 3 2 1
4 AAAAAACCGGTCACAGTTCAGATTG b 1 1 1 2
5 AAAAAACCGGTCACAGTTCAGATTG t 1 1 3 1
Expected Ouput:
seq loc Group
1 AAAAAACCAGTCCCAGTTCGGATTG t Ball
2 AAAAAACCGGTCACAGTTCAGATTG t Square
dput of df1:
structure(list(seq = structure(c(1L, 2L, 2L, 3L, 3L), .Label =
c("AAAAAACCAGTCCCAGTTCGGATTG",
"AAAAAACCAGTCTCAGTTCGGATTG", "AAAAAACCGGTCACAGTTCAGATTG"), class =
"factor"),
loc = structure(c(2L, 1L, 2L, 1L, 2L), .Label = c("b",
"t"), class = "factor"), Ball = c("3", "1", "1", "1", "1"
), Cat = c("1", "1", "3", "1", "1"), Square = c("1", "3",
"2", "1", "3"), Water = c("1", "3", "1", "2", "1")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
回答1:
Here's a solution without tidyverse and even *apply functions. First, let's convert those four columns to integers:
cols <- 3:6
df1[cols] <- lapply(df1[cols], as.integer)
Then
df <- df1[rowSums(df1[cols]) == (3 + length(cols) - 1) & rowSums(df1[cols] == 3) == 1, ]
df$Group <- names(df)[cols][which(t(df[cols]) == 3, arr.ind = TRUE)[, 1]]
df
# A tibble: 2 x 7
# seq loc Ball Cat Square Water Group
# <fct> <fct> <int> <int> <int> <int> <chr>
# 1 AAAAAACCAGTCCCAGTTCGGATTG t 3 1 1 1 Ball
# 2 AAAAAACCGGTCACAGTTCAGATTG t 1 1 3 1 Square
In the first line I select the right rows with two conditions: there has to be only one element equal to 3 in those cols
columns (rowSums(df1[cols] == 3) == 1
) and the total sum of the row has to be 3 + length(cols) - 1
. Then in the second row I check which columns have 3
and pick corresponding names of df
as values for Group
.
回答2:
I often use the basic apply
when doing rowwise calculations. You could do something with the actual dplyr::rowwise
if you wanted a tidyverse solution. Here's just using base R:
filter_on = apply(X = df1[3:6],
MARGIN = 1,
FUN = function(x){sum(x == 3) == 1 & sum(x == 1) == 3})
df1 = df1[filter_on,]
columns = colnames(df1)[3:6]
df1$Group = unlist(apply(X = df1[3:6],
MARGIN = 1,
FUN = function(x){columns[x == 3]}))
回答3:
Just to show an alternative where we work with data in long format instead of row-wise. Here, using data.table
functions:
library(data.table)
d <- melt(setDT(df1), id.vars = c("seq", "loc"))
d[d[ , .I[sum(value == 3) == 1 & !any(value == 2)], by = .(seq, loc)]$V1][value == 3]
# seq loc variable value
# 1: AAAAAACCAGTCCCAGTTCGGATTG t Ball 3
# 2: AAAAAACCGGTCACAGTTCAGATTG t Square 3
melt
data to long format using 'sec' and 'loc' as id variables. If the combination of 'sec' and 'loc' are not unique identifiers of rows, create a unique row index (e.g. ri := 1:.N
).
For each 'sec' and 'loc' (by = .(seq, loc)
; i.e. for each row in original data), create a logical vector for the desired condition: one 3 and no 2 per row (sum(value == 3) == 1 & !any(value == 2)
). Grab corresponding row indexes (.I
). The indexes, auto-named 'V1', are then used to subset 'd'.
Finally, select rows where 'value' equals 3 ([value == 3]
).
回答4:
Putting in an extra version. This only covers the row selection.
#create vector of wanted column names
cols <- c("Ball", "Cat", "Square", "Water")
#make values numeric
df1[, cols] <- df1[, cols] %>% mutate_if(is.character, as.numeric)
#filter rows
df1[which((rowSums(df1[, cols]) == (length(cols)+2) ) & (rowSums(df1[, cols] == 2) == 0)),]
seq loc Ball Cat Square Water
1 AAAAAACCAGTCCCAGTTCGGATTG t 3 1 1 1
5 AAAAAACCGGTCACAGTTCAGATTG t 1 1 3 1
Looks like the apply
version is fastest of the first three posts, but not by much.
microbenchmark::microbenchmark(
which = df1[which((rowSums(df1[, cols]) == (length(cols)+2) ) & (rowSums(df1[, cols] == 2) == 0)),],
filter = df1[rowSums(df1[cols]) == (3 + length(cols) - 1) & rowSums(df1[cols] == 3) == 1, ],
apply = df1[apply(X = df1[3:6],
MARGIN = 1,
FUN = function(x){sum(x == 3) == 1 & sum(x == 1) == 3}),]
)
Unit: microseconds
expr min lq mean median uq max neval cld
which 429.043 436.4665 446.2817 445.811 451.3140 493.553 100 a
filter 429.555 435.5715 447.8151 440.307 449.2670 724.202 100 a
apply 339.958 346.9975 435.0437 351.222 362.2295 8141.819 100 a
回答5:
My solution was a take off of @Julius Vainora.. Mine is more convoluted but I used match()
and added an index column.
DF$index <- seq.int(nrow(DF))
col_names <- names(DF)[3:ncol(DF)]
DF$Group <- col_names[which(DF[cols] == 3, arr.ind = TRUE)[,2][
DF$index[match(
DF$index, which(
DF[cols] == 3, arr.ind = TRUE[,1])]]]
来源:https://stackoverflow.com/questions/53841449/determining-if-one-value-occurs-once-in-a-row-of-columns-but-a-second-value-doe