How to get counts of intersections of six or more sets?

后端 未结 3 1880
谎友^
谎友^ 2021-01-28 02:09

I am running an analysis of a number of sets and I have been using the package VennDiagram, which has been working just fine, but it only handles up to 5 sets, and now it turns

相关标签:
3条回答
  • 2021-01-28 02:51

    OK, here's one way, assuming you represent sets as a list of vectors, and items to be searched in those sets also as vector:

    # Example data format
    sets <- list(v1 = 1:6, v2 = 1:8, v3 = 3:8)
    items <- c(2:7)
    
    # Search for items in each set
    result <- data.frame(searched = items)
    for (set in names(sets)) {
      result <- cbind(result, items %in% sets[[set]])
      names(result)[length(names(result))] <- set
    }
    
    # Count
    library(plyr)
    ddply(result, names(sets), function (i) {
      data.frame(count = nrow(i))
    })
    

    This gives you all combinations actually existing in the itemset:

         v1   v2    v3 count
    1 FALSE TRUE  TRUE     1
    2  TRUE TRUE FALSE     1
    3  TRUE TRUE  TRUE     4
    
    0 讨论(0)
  • 2021-01-28 02:55

    Here is a recursive solution to find all of the intersections in the venn diagram. sets can be a list containing any number of sets to find the intersections of. For some reason, the code in the package you are using is all hard-coded for each set size, so it doesn't scale to arbitrary intersections.

    ## Build intersections, 'out' accumulates the result
    intersects <- function(sets, out=NULL) {
        if (length(sets) < 2) return ( out )                               # return result
        len <- seq(length(sets))
        if (missing(out)) out <- list()                                    # initialize accumulator
        for (idx in split((inds <- combn(length(sets), 2)), col(inds))) {  # 2-way combinations
            ii <- len > idx[2] & !(len %in% idx)                           # indices to keep for next intersect
            out[[(n <- paste(names(sets[idx]), collapse="."))]] <- intersect(sets[[idx[1]]], sets[[idx[2]]])
            out <- intersects(append(out[n], sets[ii]), out=out)
        }
        out
    }
    

    The function builds pairwise intersections. To avoid building repeated solutions it only calls itself on components of the set with indices greater than those that were joined (ii in the code). The result is a list of all the intersections. If you pass named components, then the result will be named by the convention "set1.set2" etc.

    Results

    ## Some sample data
    set.seed(0)
    sets <- setNames(lapply(1:3, function(.) sample(letters, 10)), letters[1:3])
    
    ## Manually check intersections
    a.b <- intersect(sets[[1]], sets[[2]])
    b.c <- intersect(sets[[2]], sets[[3]])
    a.c <- intersect(sets[[1]], sets[[3]])
    a.b.c <- intersect(a.b, sets[[3]])
    
    ## Compare
    res <- intersects(sets)
    all.equal(res[c("a.b","a.c","b.c","a.b.c")], list(a.b=a.b, a.c=a.c, b.c=b.c, a.b.c=a.b.c))
    # TRUE
    
    res
    # $a.b
    # [1] "g" "i" "n" "e" "r"
    # 
    # $a.b.c
    # [1] "g"
    # 
    # $a.c
    # [1] "x" "g"
    # 
    # $b.c
    # [1] "f" "g"
    
    ## Get the counts of intersections
    lengths(res)
    # a.b a.b.c   a.c   b.c 
    #   5     1     2     2 
    

    Or, with numbers

    intersects(list(a=1:10, b=c(1, 5, 10), c=9:20))
    # $a.b
    # [1]  1  5 10
    # $a.b.c
    # [1] 10
    # $a.c
    # [1]  9 10
    # $b.c
    # [1] 10
    
    0 讨论(0)
  • 2021-01-28 03:00

    Here's an attempt:

    list1 <- c("a","b","c","e")
    list2 <- c("a","b","c","e")
    list3 <- c("a","b")
    list4 <- c("a","b","g","h")
    list_names <- c("list1","list2","list3","list4")
    
    lapply(1:length(list_names),function(y){
    combinations <- combn(list_names,y)
    res<-as.list(apply(combinations,2,function(x){
        if(length(x)==1){
                p <- setdiff(get(x),unlist(sapply(setdiff(list_names,x),get)))
            }
    
        else if(length(x) < length(list_names)){
                p <- setdiff(Reduce(intersect,lapply(x,get)),Reduce(union,sapply(setdiff(list_names,x),get)))
            }
    
        else p <- Reduce(intersect,lapply(x,get))
    
        if(!identical(p,character(0))) p
        else NA
    }))
    
    if(y==length(list_names)) {
            res[[1]] <- unlist(res); 
            res<-res[1]
    }
    names(res) <- apply(combinations,2,paste,collapse="-")
    res
    })
    

    The first lapply is used to loop from 1 to the number of sets you have. Then I took all possible combinations of list names, taken y at a time. This essentially generates all of the different subareas in the Venn diagram.

    For each combination, the output is the difference between the intersection of the lists in the current combination to the union of the other lists that are not in the combination.

    The final result is a list of length the number of sets inputed. The first element of that list holds the unique elements in each list, the second element the unique elements in any combination of two lists etc.

    0 讨论(0)
提交回复
热议问题