Calculate function on a column of nested tibbles?

这一生的挚爱 提交于 2019-12-11 06:24:44

问题


I have a dataframe with a column of tibbles. Here is a portion of my data:

date        time        uuid                data
2018-06-23  18:25:24    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:25:38    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:26:01    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:26:23    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:26:37    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:27:00    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:27:22    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:27:39    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:28:06    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:28:30    0b27ea5fad61c99d    <tibble>

And this is my function:

jaccard <- function(vector1, vector2) {

  return(length(intersect(vector1, vector2)) / 
        length(union(vector1, vector2)))

}

My data column consists of tibbles with one column of characters:

contacts
5646
65748
115
498456
35135

My goal is to calculate jaccard between each 2 consecutive tibbles in the data column.

I have tried:

df %>% mutate(j = jaccard(data, lag(data, 1))) but it doesn't seem to work for some reason.

I know I am close, please advise.


回答1:


The reason is that jaccard function is not written to handle vector arguments. As you know that functions used as part of mutate receive a vector of data (vector of 10 tibbles in case of OP's example). Now, since jaccard function is not written to handle arguments of vector(vector of tibbles) the result will not meet expectation.

The easiest fix can be to vectorise jaccard function so that it can handle vector arguments. Once can use Vectorize to convert the function as:

# Function 
jaccard <- function(vector1, vector2) {
  return(length(intersect(vector1, vector2)) / 
           length(union(vector1, vector2)))
}
# Vectorised version of jaccard function
jaccardV <- Vectorize(jaccard)


library(dplyr)
df %>%
  mutate(j = jaccardV(data, lag(data, 1)))

#          date     time             uuid                            data         j
# 1  2018-06-23 18:25:24 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135 0.0000000
# 2  2018-06-23 18:25:38 0b27ea5fad61c99d                     5646, 65748 0.4000000
# 3  2018-06-23 18:26:01 0b27ea5fad61c99d                5646, 65748, 115 0.6666667
# 4  2018-06-23 18:26:23 0b27ea5fad61c99d                            5646 0.3333333
# 5  2018-06-23 18:26:37 0b27ea5fad61c99d                     5646, 65748 0.5000000
# 6  2018-06-23 18:27:00 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135 0.4000000
# 7  2018-06-23 18:27:22 0b27ea5fad61c99d                     5646, 65748 0.4000000
# 8  2018-06-23 18:27:39 0b27ea5fad61c99d                5646, 65748, 115 0.6666667
# 9  2018-06-23 18:28:06 0b27ea5fad61c99d                            5646 0.3333333
# 10 2018-06-23 18:28:30 0b27ea5fad61c99d                     5646, 65748 0.5000000

Data:

df <- read.table(text="
date        time        uuid                
2018-06-23  18:25:24    0b27ea5fad61c99d    
2018-06-23  18:25:38    0b27ea5fad61c99d    
2018-06-23  18:26:01    0b27ea5fad61c99d    
2018-06-23  18:26:23    0b27ea5fad61c99d    
2018-06-23  18:26:37    0b27ea5fad61c99d    
2018-06-23  18:27:00    0b27ea5fad61c99d    
2018-06-23  18:27:22    0b27ea5fad61c99d    
2018-06-23  18:27:39    0b27ea5fad61c99d    
2018-06-23  18:28:06    0b27ea5fad61c99d    
2018-06-23  18:28:30    0b27ea5fad61c99d",
header = TRUE, stringsAsFactors = FALSE)

t1 <- tibble(contacts = c(5646,65748,115,498456,35135))
t2 <- tibble(contacts = c(5646,65748))
t3 <- tibble(contacts = c(5646,65748,115))
t4 <- tibble(contacts = c(5646))
t5 <- tibble(contacts = c(5646,65748))


df$data <- c(t1,t2,t3,t4,t5)

df
#          date     time             uuid                            data
# 1  2018-06-23 18:25:24 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135
# 2  2018-06-23 18:25:38 0b27ea5fad61c99d                     5646, 65748
# 3  2018-06-23 18:26:01 0b27ea5fad61c99d                5646, 65748, 115
# 4  2018-06-23 18:26:23 0b27ea5fad61c99d                            5646
# 5  2018-06-23 18:26:37 0b27ea5fad61c99d                     5646, 65748
# 6  2018-06-23 18:27:00 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135
# 7  2018-06-23 18:27:22 0b27ea5fad61c99d                     5646, 65748
# 8  2018-06-23 18:27:39 0b27ea5fad61c99d                5646, 65748, 115
# 9  2018-06-23 18:28:06 0b27ea5fad61c99d                            5646
# 10 2018-06-23 18:28:30 0b27ea5fad61c99d                     5646, 65748


来源:https://stackoverflow.com/questions/51008774/calculate-function-on-a-column-of-nested-tibbles

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!