问题
I have a rather large tibble (called df.tbl
with ~ 26k rows and 22 columns) and I want to find the "twins" of each object, i.e. each row that has the same values in column 2:7 (date:Pos).
If I use:
inner_join(df.tbl, ~ df.tbl[i,], by = c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos"))
with i
being the row I want to check for "twins", everything is working as expected, spitting out a 2 x 22 tibble, and I can expand this using:
x <- NULL
for (i in 1:nrow(df.tbl)) {
x[[i]] <- as_vector(inner_join(df.tbl[,],
df.tbl[i,],
by = c("date",
"forge",
"serNum",
"PinMain",
"PinMainNumber",
"Pos")) %>%
select(rowNum.x)
}
to create a list containing the row numbers for each twin for each object (row).
I cannot, however I try, use map
to produce a similar result:
twins <- map(df.tbl, ~ inner_join(df.tbl,
.,
by = c("date",
"forge",
"serNum",
"PinMain",
"PinMainNumber",
"Pos")) %>%
select(rowNum.x) )
All I get is the following error:
Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')"
How would I go about to convert the for
loop into an equivalent using map
?
My original data look like this:
>head(df.tbl, 3)
# A tibble: 3 x 22
rowNum date forge serNum PinMain PinMainNumber Pos FrontBack flow Sharped SV OP max min mean
<dbl> <date> <chr> <fct> <fct> <fct> <fct> <fct> <chr> <fct> <fct> <chr> <dbl> <dbl> <dbl>
1 1 2017-10-18 NA 179 Pin 1 W F NA 3 36237 235 77.7 55.3 64.7
2 2 2017-10-18 NA 179 Pin 2 W F NA 3 36237 235 77.5 52.1 67.4
3 3 2017-10-18 NA 179 Pin 3 W F NA 3 36237 235 79.5 58.6 69.0
# ... with 7 more variables: median <dbl>, sd <dbl>, Round2 <dbl>, Round4 <dbl>, OrigData <list>, dataSize <int>,
# fileName <chr>
and I would like a list with a length the same as nrow(df.tbl) looking like this:
> twins
[[1]]
[1] 1 7
[[2]]
[1] 2 8
[[3]]
[1] 3 9
Almost all objects have one twin / duplicate (as above) but a few have two or even three duplicates (as defined above, i.e. column 2:7 are the same)
回答1:
A bit late to the party, but you can do it much more neatly with nest()
.
tbl.df1 <- tbl.df %>% group_by(date, forge, serNum, PinMain, PinMainNumber, Pos) %>% nest(rowNum)
The twins will be in the list of tibbles created by nest
.
tbl.df1$data
# [[1]]
# A tibble: 2 x 1
# rowNum
# <dbl>
# 1 1
# 2 7
#[[2]]
# A tibble: 2 x 1
# rowNum
# <dbl>
# 1 2
# 2 8
# etc
回答2:
do you really need to solve it with map
?
I would solve through combining duplicated
and semi_join
from the package dplyr
like this
defining_columns <- c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos")
dplyr::semi_join(
df.tbl,
df.tbl[duplicated(df.tbl[defining_columns]),],
by = defining_columns
) %>%
group_by_at(defining_columns) %>%
arrange(.by_group = TRUE) %>%
summarise(twins = paste0(rowNum,collapse = ",")) %>%
pull(twins) %>%
strsplit(",")
the duplicated
gives us which rows are duplicated and the semi_join
only keeps rows in x
that are present in y
Hope this helps!!
来源:https://stackoverflow.com/questions/53536984/finding-duplicate-observations-of-selected-variables-in-a-tibble