F#: removing duplicates from a seq is slow

后端 未结 9 1063
半阙折子戏
半阙折子戏 2021-01-12 03:11

I am attempting to write a function that weeds out consecutive duplicates, as determined by a given equality function, from a seq<\'a> but with a twist:

9条回答
  •  时光说笑
    2021-01-12 03:56

    The performance issue comes from the nested calls to Seq.tail. Here's the source code to Seq.tail

    []
    let tail (source: seq<'T>) =
        checkNonNull "source" source
        seq { use e = source.GetEnumerator() 
              if not (e.MoveNext()) then 
                  invalidArg "source" (SR.GetString(SR.notEnoughElements))
              while e.MoveNext() do
                  yield e.Current }
    

    If you call Seq.tail(Seq.tail(Seq.tail(...))) the compiler has no way of optimizing out the enumerators that are created by GetEnumerator(). Subsequent returned elements have to go through every nested sequence and enumerator. This results in every returned element having to bubble up through all previously created sequences and all of these sequences have to increment their internal state as well. The net result is a running time of O(n^2) instead of linear O(n).

    Unfortunately there is currently no way to represent this in a functional style in F#. You can with a list (x::xs) but not for a sequence. Until the language gets better native support for sequences, don't use Seq.tail in recursive functions.

    Using a single enumerator will fix the performance problem.

    let RemoveDuplicatesKeepLast equals (items:seq<_>) =
        seq {
            use e = items.GetEnumerator()
    
            if e.MoveNext() then
                let mutable previous = e.Current
    
                while e.MoveNext() do
                    if not (previous |> equals e.Current) then 
                        yield previous
                    previous <- e.Current
    
                yield previous
        }
    
    let test = [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)]
    let FirstEqual a b = fst a = fst b
    
    RemoveDuplicatesKeepLast FirstEqual test
    |> printf "%A"
    
    // output
    // seq [("a", 1); ("b", 4); ("c", 5)]
    

    The first version of this answer has a recursive version of the above code without mutation.

提交回复
热议问题