Grouping a many-to-many relationship from a two-column map

后端 未结 3 1552
无人及你
无人及你 2021-01-13 16:45

I have a SQL table that maps, say, authors and books. I would like to group linked authors and books (books written by the same author, and authors who co-wrote a book) toge

相关标签:
3条回答
  • 2021-01-13 17:26

    Here's a go re-hashing my answer to an old question of mine that Josh O'Brien linked in the comments ( identify groups of linked episodes which chain together ). This answer uses the igraph library.

    # Dummy data that might be easier to interpret to show it worked
    # Authors 1,2 and 3,4 should group. author 5 is a group to themselves
    aubk <- data.frame(author_id=c(1,2,3,4,5),book_id=c(1,1,2,2,5))
    
    # identify authors with a bit of leading text to prevent clashes 
    # with the book ids
    aubk$author_id2 <- paste0("au",aubk$author_id)
    
    library(igraph)
    #create a graph - this needs to be matrix input
    au_graph <- graph.edgelist(as.matrix(aubk[c("author_id2","book_id")]))
    # get the ids of the authors
    result <- data.frame(author_id=names(au_graph[1]),stringsAsFactors=FALSE)
    # get the corresponding group membership of the authors
    result$group <- clusters(au_graph)$membership
    
    # subset to only the authors data
    result <- result[substr(result$author_id,1,2)=="au",]
    # make the author_id variable numeric again
    result$author_id <- as.numeric(substr(result$author_id,3,nchar(result$author_id)))
    
    > result
      author_id group
    1         1     1
    3         2     1
    4         3     2
    6         4     2
    7         5     3
    
    0 讨论(0)
  • 2021-01-13 17:36

    A couple of suggestions

    aubk[,list(author_list = list(sort(author_id))), by = book_id]
    

    will give a list of author groups

    The followingwill create a unique identifier for each group of authors and then return a list with

    • the number of books
    • A list of the book ids
    • A unique identifier of the book_ids
    • number of authors

    for each unique group of authors

    aubk[, list(author_list = list(sort(author_id)), 
                group_id = paste0(sort(author_id), collapse=','), 
                n_authors = .N),by =  book_id][,
            list(n_books = .N, 
                 n_authors = unique(n_authors), 
                 book_list = list(book_id), 
                 book_ids = paste0(book_id, collapse = ', ')) ,by = group_id]
    

    If the author order matters, just remove the sort with the definitions of author_list and group_id

    EDIT

    noting that the above, while useful does not do the appropriate grouping

    Perhaps the following will

    # the unique groups of authors by book
    unique_authors <- aubk[, list(sort(author_id)), by = book_id]
    # some helper functions
    # a filter function that allows arguments to be passed
    .Filter <- function (f, x,...) 
    {
      ind <- as.logical(sapply(x, f,...))
      x[!is.na(ind) & ind]
    }
    
    # any(x in y)?
    `%%in%%` <- function(x,table){any(unlist(x) %in% table)}
    # function to filter a list and return the unique elements from 
    # flattened values
    FilterList <- function(.list, table) {
      unique(unlist(.Filter(`%%in%%`, .list, table =table)))
    }
    
    # all the authors
    all_authors <- unique(unlist(unique_authors))
    # with names!
    setattr(all_authors, 'names', all_authors)
    # get for each author, the authors with whom they have
    # collaborated in at least 1 book
    lapply(all_authors, FilterList, .list = unique_authors)
    
    0 讨论(0)
  • 2021-01-13 17:42

    Converting 500K nodes into an adjacency matrix was too much for my computer's memory, so I couldn't use igraph. The RBGL package isn't updated for R version 2.15.1, so that was out as well.

    After writing a lot of dumb code that doesn't seem to work, I think the following gets me to the right answer.

    aubk[,grp := author_id]
    num.grp.old <- aubk[,length(unique(grp))]
    iterations <- 0
    repeat {
        aubk[,grp := min(grp),by=author_id]
        aubk[,grp := min(grp), by=book_id]
        num.grp.new <- aubk[,length(unique(grp))] 
        if(num.grp.new == num.grp.old) {break}
        num.grp.old <- num.grp.new
        iterations <- iterations + 1
    }
    
    0 讨论(0)
提交回复
热议问题