F#: removing duplicates from a seq is slow

后端 未结 9 1070
半阙折子戏
半阙折子戏 2021-01-12 03:11

I am attempting to write a function that weeds out consecutive duplicates, as determined by a given equality function, from a seq<\'a> but with a twist:

相关标签:
9条回答
  • 2021-01-12 03:51

    Bit of an old question here, but I'm just looking for old examples to demonstrate a new library that I have been working on. It's a replacement for System.Linq.Enumerable, but also it has a wrapper to replace F#'s Seq. It's not complete yet, but it's polyfill'd up to match the existing APIs (i.e. incomplete material just forwards to existing functionality).

    It is available in on nuget here: https://www.nuget.org/packages/Cistern.Linq.FSharp/

    So I have taken your modified Seq from the bottom of your answer and "converted" it to Cistern.Linq.FSharp (which is just a search and replace of "Seq." for "Linq.") And then compared it's runtime to your original. The Cistern version runs at well under 50% of the time (I get ~41%).

    open System
    open Cistern.Linq.FSharp
    open System.Diagnostics
    
    let dedupeTakingLastCistern equalityFn s = 
        s 
        |> Linq.map Some
        |> fun x -> Linq.append x [None]
        |> Linq.pairwise
        |> Linq.map (fun (x,y) -> 
                match (x,y) with 
                | (Some a, Some b) -> (if (equalityFn a b) then None else Some a)  
                | (_,None) -> x
                | _ -> None )
        |> Linq.choose id
    
    let dedupeTakingLastSeq equalityFn s = 
        s 
        |> Seq.map Some
        |> fun x -> Seq.append x [None]
        |> Seq.pairwise
        |> Seq.map (fun (x,y) -> 
                match (x,y) with 
                | (Some a, Some b) -> (if (equalityFn a b) then None else Some a)  
                | (_,None) -> x
                | _ -> None )
        |> Seq.choose id
    
    let test data which f =
        let iterations = 1000
    
        let sw = Stopwatch.StartNew ()
        for i = 1 to iterations do
            data
            |> f (fun x y -> x = y)
            |> List.ofSeq    
            |> ignore
        printfn "%s %d" which sw.ElapsedMilliseconds
    
    
    [<EntryPoint>]
    let main argv =
        let data = List.init 10000 (fun _ -> 1)
    
        for i = 1 to 5 do
            test data "Seq" dedupeTakingLastSeq
            test data "Cistern" dedupeTakingLastCistern
    
        0
    
    0 讨论(0)
  • 2021-01-12 03:56

    The performance issue comes from the nested calls to Seq.tail. Here's the source code to Seq.tail

    [<CompiledName("Tail")>]
    let tail (source: seq<'T>) =
        checkNonNull "source" source
        seq { use e = source.GetEnumerator() 
              if not (e.MoveNext()) then 
                  invalidArg "source" (SR.GetString(SR.notEnoughElements))
              while e.MoveNext() do
                  yield e.Current }
    

    If you call Seq.tail(Seq.tail(Seq.tail(...))) the compiler has no way of optimizing out the enumerators that are created by GetEnumerator(). Subsequent returned elements have to go through every nested sequence and enumerator. This results in every returned element having to bubble up through all previously created sequences and all of these sequences have to increment their internal state as well. The net result is a running time of O(n^2) instead of linear O(n).

    Unfortunately there is currently no way to represent this in a functional style in F#. You can with a list (x::xs) but not for a sequence. Until the language gets better native support for sequences, don't use Seq.tail in recursive functions.

    Using a single enumerator will fix the performance problem.

    let RemoveDuplicatesKeepLast equals (items:seq<_>) =
        seq {
            use e = items.GetEnumerator()
    
            if e.MoveNext() then
                let mutable previous = e.Current
    
                while e.MoveNext() do
                    if not (previous |> equals e.Current) then 
                        yield previous
                    previous <- e.Current
    
                yield previous
        }
    
    let test = [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)]
    let FirstEqual a b = fst a = fst b
    
    RemoveDuplicatesKeepLast FirstEqual test
    |> printf "%A"
    
    // output
    // seq [("a", 1); ("b", 4); ("c", 5)]
    

    The first version of this answer has a recursive version of the above code without mutation.

    0 讨论(0)
  • 2021-01-12 03:56

    The problem is with how you use sequences. All those yields, heads and tails are spinning a web of iterators branching off of iterators, and when you finally materialize it when you call List.ofSeq, you're iterating through your input sequence way more than you should.

    Each of those Seq.heads is not simply taking the first element of a sequence - it's taking the first element of the tail of a sequence of a tail of a sequence of tail of a sequence and so on.

    Check this out - it'll count the times the element constructor is called:

    let count = ref 0
    
    Seq.init 1000 (fun i -> count := !count + 1; 1) 
    |> dedupeTakingLast (fun (x,y) -> x = y) None 
    |> List.ofSeq
    

    Incidentally, just switching out all the Seqs to Lists makes it go instantly.

    0 讨论(0)
  • 2021-01-12 04:05

    Seq.isEmpty, Seq.head and Seq.tail are slow because they all create a new Enumerator instance which it then calls into. You end up with a lot of GC.

    Generally, Sequences are forward only, and if you use them 'like pattern matching for lists', the performance becomes really shoddy.

    Looking a bit at your code... | None -> yield! s creates a new Enumerator even though we know s is empty. Every recursive call probably ends up creating a new IEnumerable that is then directly turned into an Enumerator from the call-site with yield!.

    0 讨论(0)
  • 2021-01-12 04:05

    Here is an implementation using mapFold but requires passing in a value not equal to the last value. Eliminates the need to write a recursive function. Should run faster but not tested.

    let dedupe notLast equalityfn (s:'a seq) =
        [notLast]
        |> Seq.append s
        |>  Seq.mapFold
                (fun prev item  -> 
                    if equalityfn prev item 
                        then (None, item)
                        else (Some(prev), item))
                (Seq.head s)
        |>  fst
        |>  Seq.choose id
    
    let items = [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)] 
    
    let unique =     
        dedupe ("", 0) (fun (x1, x2) (y1, y2) -> x1 = y1) items 
    
    printfn "%A" unique
    
    0 讨论(0)
  • 2021-01-12 04:08

    I'm also looking forward to a non-seq answer. Here's another solution:

    let t = [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)]
    t |> Seq.groupBy fst |> Seq.map (snd >>  Seq.last)
    

    I tested on a 1M list:

    Real: 00:00:00.000, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
    val it : seq<int * int> = seq [(2, 2); (1, 1)]
    
    0 讨论(0)
提交回复
热议问题