问题
I have a dataframe with a column of tibbles. Here is a portion of my data:
date time uuid data
2018-06-23 18:25:24 0b27ea5fad61c99d <tibble>
2018-06-23 18:25:38 0b27ea5fad61c99d <tibble>
2018-06-23 18:26:01 0b27ea5fad61c99d <tibble>
2018-06-23 18:26:23 0b27ea5fad61c99d <tibble>
2018-06-23 18:26:37 0b27ea5fad61c99d <tibble>
2018-06-23 18:27:00 0b27ea5fad61c99d <tibble>
2018-06-23 18:27:22 0b27ea5fad61c99d <tibble>
2018-06-23 18:27:39 0b27ea5fad61c99d <tibble>
2018-06-23 18:28:06 0b27ea5fad61c99d <tibble>
2018-06-23 18:28:30 0b27ea5fad61c99d <tibble>
And this is my function:
jaccard <- function(vector1, vector2) {
return(length(intersect(vector1, vector2)) /
length(union(vector1, vector2)))
}
My data column consists of tibbles with one column of characters:
contacts
5646
65748
115
498456
35135
My goal is to calculate jaccard between each 2 consecutive tibbles in the data column.
I have tried:
df %>% mutate(j = jaccard(data, lag(data, 1)))
but it doesn't seem to work for some reason.
I know I am close, please advise.
回答1:
The reason is that jaccard
function is not written to handle vector arguments. As you know that functions used as part of mutate
receive a vector of data (vector of 10 tibbles
in case of OP's example). Now, since jaccard
function is not written to handle arguments of vector(vector of tibbles) the result will not meet expectation.
The easiest fix can be to vectorise jaccard
function so that it can handle vector arguments. Once can use Vectorize
to convert the function as:
# Function
jaccard <- function(vector1, vector2) {
return(length(intersect(vector1, vector2)) /
length(union(vector1, vector2)))
}
# Vectorised version of jaccard function
jaccardV <- Vectorize(jaccard)
library(dplyr)
df %>%
mutate(j = jaccardV(data, lag(data, 1)))
# date time uuid data j
# 1 2018-06-23 18:25:24 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135 0.0000000
# 2 2018-06-23 18:25:38 0b27ea5fad61c99d 5646, 65748 0.4000000
# 3 2018-06-23 18:26:01 0b27ea5fad61c99d 5646, 65748, 115 0.6666667
# 4 2018-06-23 18:26:23 0b27ea5fad61c99d 5646 0.3333333
# 5 2018-06-23 18:26:37 0b27ea5fad61c99d 5646, 65748 0.5000000
# 6 2018-06-23 18:27:00 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135 0.4000000
# 7 2018-06-23 18:27:22 0b27ea5fad61c99d 5646, 65748 0.4000000
# 8 2018-06-23 18:27:39 0b27ea5fad61c99d 5646, 65748, 115 0.6666667
# 9 2018-06-23 18:28:06 0b27ea5fad61c99d 5646 0.3333333
# 10 2018-06-23 18:28:30 0b27ea5fad61c99d 5646, 65748 0.5000000
Data:
df <- read.table(text="
date time uuid
2018-06-23 18:25:24 0b27ea5fad61c99d
2018-06-23 18:25:38 0b27ea5fad61c99d
2018-06-23 18:26:01 0b27ea5fad61c99d
2018-06-23 18:26:23 0b27ea5fad61c99d
2018-06-23 18:26:37 0b27ea5fad61c99d
2018-06-23 18:27:00 0b27ea5fad61c99d
2018-06-23 18:27:22 0b27ea5fad61c99d
2018-06-23 18:27:39 0b27ea5fad61c99d
2018-06-23 18:28:06 0b27ea5fad61c99d
2018-06-23 18:28:30 0b27ea5fad61c99d",
header = TRUE, stringsAsFactors = FALSE)
t1 <- tibble(contacts = c(5646,65748,115,498456,35135))
t2 <- tibble(contacts = c(5646,65748))
t3 <- tibble(contacts = c(5646,65748,115))
t4 <- tibble(contacts = c(5646))
t5 <- tibble(contacts = c(5646,65748))
df$data <- c(t1,t2,t3,t4,t5)
df
# date time uuid data
# 1 2018-06-23 18:25:24 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135
# 2 2018-06-23 18:25:38 0b27ea5fad61c99d 5646, 65748
# 3 2018-06-23 18:26:01 0b27ea5fad61c99d 5646, 65748, 115
# 4 2018-06-23 18:26:23 0b27ea5fad61c99d 5646
# 5 2018-06-23 18:26:37 0b27ea5fad61c99d 5646, 65748
# 6 2018-06-23 18:27:00 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135
# 7 2018-06-23 18:27:22 0b27ea5fad61c99d 5646, 65748
# 8 2018-06-23 18:27:39 0b27ea5fad61c99d 5646, 65748, 115
# 9 2018-06-23 18:28:06 0b27ea5fad61c99d 5646
# 10 2018-06-23 18:28:30 0b27ea5fad61c99d 5646, 65748
来源:https://stackoverflow.com/questions/51008774/calculate-function-on-a-column-of-nested-tibbles