Can an array be grouped more efficiently than sorted?

前端 未结 4 485
青春惊慌失措
青春惊慌失措 2020-12-06 06:44

While working on example code for an algorithm question, I came across the situation where I was sorting an input array, even though I only needed to have identical elements

相关标签:
4条回答
  • 2020-12-06 07:24

    Any sorting algorithm, even the most efficient ones, will require you to traverse the array multiple times. Grouping on the other hand can be done in exactly one iteration, depending on how you insist your result be formatted two:

    groups = {}
    for i in arr:
        if i not in groups:
            groups[i] = []
        groups[i].append(i)
    

    This is an extremely primitive loop ignoring many of the optimisations and idioms probably available in your language of choice, but results in this after just one iteration:

    {1: [1, 1], 2: [2, 2], 3: [3], 4: [4, 4]}
    

    If you have complex objects, you can choose any arbitrary attribute to group by as the dictionary key, so this is a very generic algorithm.

    If you insist on your result being a flat list, you can achieve that easily:

    result = []
    for l in groups:
        result += l
    

    (Again, ignoring specific language optimisations and idioms.)

    So there you have a constant time solution requiring at most one full iteration of the input and one smaller iteration of the intermediate grouping data structure. The space requirements depend on the specifics of the language, but are typically only whatever little bit of overhead the dictionary and list data structures incur.

    0 讨论(0)
  • 2020-12-06 07:28

    Yes, all you need to do is to create a dictionary and count how many elements of each time you have. After that just iterate over keys in that dictionary and output this key the same number of time as the value of that key.

    Quick python implementation:

    from collections import Counter
    arr = [1,2,4,1,4,3,2]
    cnt, grouped = Counter(arr), []  # counter create a dictionary which counts the number of each element
    for k, v in cnt.iteritems():
        grouped += [k] * v # [k] * v create an array of length v, which has all elements equal to k
    
    print grouped
    

    This will group all the elements in O(n) time using potentially O(n) additional space. Which is more efficiently (in terms of time complexity) than a sorting which will achieve this in O(n logn) time and can be done inplace.

    0 讨论(0)
  • 2020-12-06 07:43

    Since you asked about comparison-based methods, I'm going to make the usual assumptions that (1) elements can be compared but not hashed (2) the only resource of interest is three-way operations.

    In an absolute sense, it's easier to group than to sort. Here's a grouping algorithm for three elements that uses one comparison (sorting requires three). Given an input x, y, z, if x = y, then return x, y, z. Otherwise, return x, z, y.

    Asymptotically, however, both grouping and sorting require Omega(n log n) comparisons. The lower bound technique is information-theoretic: we prove that, for every grouping algorithm expressed as a decision tree, there are 3^Omega(n log n) leaves, which implies that the height of the tree (and hence the worst-case running time of the algorithm) is Omega(n log n).

    Fix an arbitrary leaf of the decision tree where no input elements are found to be equal. The input positions are partially ordered by the inequalities found.

    Suppose to the contrary that i, j, k are pairwise incomparable input positions. Letting x = input[i], y = input[j], z = input[k], the possibilities x = y < z and y = z < x and z = x < y are all consistent with what the algorithm has observed. This cannot be, since it is impossible for the one order chosen by the leaf to put x next to y next to z next to x. We conclude that the partial order has no antichain of cardinality three.

    By Dilworth's theorem, the partial order has two chains that cover the whole input. By considering all possible ways to merge these chains into a total order, there are at most n choose m ≤ 2^n permutations that map to each leaf. The number of leaves is thus at least n!/2^n = 3^Omega(n log n).

    0 讨论(0)
  • 2020-12-06 07:46

    How about using a 2-dimensional array with the 1st dimension being the frequency of each value, and the second dimension is the value itself. We can take advantage of the Boolean data type and indexing. This also allows us to sort the original array instantly while looping over the original array exactly one time giving us an O(n) solution. I'm thinking that this approach will translate well to other languages. Observe the following base R code (N.B. there are far more efficient ways in R than the below, I'm simply giving a more general approach).

    GroupArray <- function(arr.in) {
    
        maxVal <- max(arr.in)
    
        arr.out.val <- rep(FALSE, maxVal)  ## F, F, F, F, ...
        arr.out.freq <- rep(0L, maxVal)     ## 0, 0, 0, 0, ... 
    
        for (i in arr.in) {
            arr.out.freq[i] <- arr.out.freq[i]+1L
            arr.out.val[i] <- TRUE
        }
    
        myvals <- which(arr.out.val)   ## "which" returns the TRUE indices
    
        array(c(arr.out.freq[myvals],myvals), dim = c(length(myvals), 2), dimnames = list(NULL,c("freq","vals")))
    }
    

    Small example of the above code:

    set.seed(11)
    arr1 <- sample(10, 10, replace = TRUE)
    
    arr1                                    
    [1]  3  1  6  1  1 10  1  3  9  2     ## unsorted array
    
    GroupArray(arr1)    
         freq vals       ## Nicely sorted with the frequency
    [1,]    4    1
    [2,]    1    2
    [3,]    2    3
    [4,]    1    6
    [5,]    1    9
    [6,]    1   10
    

    Larger example:

    set.seed(101)
    arr2 <- sample(10^6, 10^6, replace = TRUE)
    
    arr2[1:10]       ## First 10 elements of random unsorted array
    [1] 372199  43825 709685 657691 249856 300055 584867 333468 622012 545829
    
    arr2[999990:10^6]     ## Last 10 elements of random unsorted array
    [1] 999555 468102 851922 244806 192171 188883 821262 603864  63230  29893 664059
    
    t2 <- GroupArray(arr2)
    head(t2)
         freq vals        ## Nicely sorted with the frequency
    [1,]    2    1
    [2,]    2    2
    [3,]    2    3
    [4,]    2    6
    [5,]    2    8
    [6,]    1    9
    
    tail(t2)
              freq    vals 
    [632188,]    3  999989
    [632189,]    1  999991
    [632190,]    1  999994
    [632191,]    2  999997
    [632192,]    2  999999
    [632193,]    2 1000000
    
    0 讨论(0)
提交回复
热议问题