Calculate function on a column of nested tibbles?

问题

I have a dataframe with a column of tibbles. Here is a portion of my data:

date        time        uuid                data
2018-06-23  18:25:24    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:25:38    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:26:01    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:26:23    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:26:37    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:27:00    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:27:22    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:27:39    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:28:06    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:28:30    0b27ea5fad61c99d    <tibble>

And this is my function:

jaccard <- function(vector1, vector2) {

  return(length(intersect(vector1, vector2)) / 
        length(union(vector1, vector2)))

}

My data column consists of tibbles with one column of characters:

contacts
5646
65748
115
498456
35135

My goal is to calculate jaccard between each 2 consecutive tibbles in the data column.

I have tried:

df %>% mutate(j = jaccard(data, lag(data, 1))) but it doesn't seem to work for some reason.

I know I am close, please advise.

回答1:

The reason is that jaccard function is not written to handle vector arguments. As you know that functions used as part of mutate receive a vector of data (vector of 10 tibbles in case of OP's example). Now, since jaccard function is not written to handle arguments of vector(vector of tibbles) the result will not meet expectation.

The easiest fix can be to vectorise jaccard function so that it can handle vector arguments. Once can use Vectorize to convert the function as:

# Function 
jaccard <- function(vector1, vector2) {
  return(length(intersect(vector1, vector2)) / 
           length(union(vector1, vector2)))
}
# Vectorised version of jaccard function
jaccardV <- Vectorize(jaccard)


library(dplyr)
df %>%
  mutate(j = jaccardV(data, lag(data, 1)))

#          date     time             uuid                            data         j
# 1  2018-06-23 18:25:24 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135 0.0000000
# 2  2018-06-23 18:25:38 0b27ea5fad61c99d                     5646, 65748 0.4000000
# 3  2018-06-23 18:26:01 0b27ea5fad61c99d                5646, 65748, 115 0.6666667
# 4  2018-06-23 18:26:23 0b27ea5fad61c99d                            5646 0.3333333
# 5  2018-06-23 18:26:37 0b27ea5fad61c99d                     5646, 65748 0.5000000
# 6  2018-06-23 18:27:00 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135 0.4000000
# 7  2018-06-23 18:27:22 0b27ea5fad61c99d                     5646, 65748 0.4000000
# 8  2018-06-23 18:27:39 0b27ea5fad61c99d                5646, 65748, 115 0.6666667
# 9  2018-06-23 18:28:06 0b27ea5fad61c99d                            5646 0.3333333
# 10 2018-06-23 18:28:30 0b27ea5fad61c99d                     5646, 65748 0.5000000

Data:

df <- read.table(text="
date        time        uuid                
2018-06-23  18:25:24    0b27ea5fad61c99d    
2018-06-23  18:25:38    0b27ea5fad61c99d    
2018-06-23  18:26:01    0b27ea5fad61c99d    
2018-06-23  18:26:23    0b27ea5fad61c99d    
2018-06-23  18:26:37    0b27ea5fad61c99d    
2018-06-23  18:27:00    0b27ea5fad61c99d    
2018-06-23  18:27:22    0b27ea5fad61c99d    
2018-06-23  18:27:39    0b27ea5fad61c99d    
2018-06-23  18:28:06    0b27ea5fad61c99d    
2018-06-23  18:28:30    0b27ea5fad61c99d",
header = TRUE, stringsAsFactors = FALSE)

t1 <- tibble(contacts = c(5646,65748,115,498456,35135))
t2 <- tibble(contacts = c(5646,65748))
t3 <- tibble(contacts = c(5646,65748,115))
t4 <- tibble(contacts = c(5646))
t5 <- tibble(contacts = c(5646,65748))


df$data <- c(t1,t2,t3,t4,t5)

df
#          date     time             uuid                            data
# 1  2018-06-23 18:25:24 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135
# 2  2018-06-23 18:25:38 0b27ea5fad61c99d                     5646, 65748
# 3  2018-06-23 18:26:01 0b27ea5fad61c99d                5646, 65748, 115
# 4  2018-06-23 18:26:23 0b27ea5fad61c99d                            5646
# 5  2018-06-23 18:26:37 0b27ea5fad61c99d                     5646, 65748
# 6  2018-06-23 18:27:00 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135
# 7  2018-06-23 18:27:22 0b27ea5fad61c99d                     5646, 65748
# 8  2018-06-23 18:27:39 0b27ea5fad61c99d                5646, 65748, 115
# 9  2018-06-23 18:28:06 0b27ea5fad61c99d                            5646
# 10 2018-06-23 18:28:30 0b27ea5fad61c99d                     5646, 65748

来源：https://stackoverflow.com/questions/51008774/calculate-function-on-a-column-of-nested-tibbles

标签

dataframe

tibble