F#: removing duplicates from a seq is slow

后端 未结 9 1084
半阙折子戏
半阙折子戏 2021-01-12 03:11

I am attempting to write a function that weeds out consecutive duplicates, as determined by a given equality function, from a seq<\'a> but with a twist:

相关标签:
9条回答
  • 2021-01-12 04:09

    Here is a pretty fast approach which uses library functions rather than Seq expressions.

    Your test runs in 0.007 seconds on my PC.

    It has a pretty nasty hack for the first element that doesn't work brilliantly that could be improved.

    let rec dedupe equalityfn prev (s:'a seq) : 'a seq =
        if Seq.isEmpty s then
            Seq.empty
        else
            let rest = Seq.skipWhile (equalityfn prev) s
            let valid = Seq.takeWhile (equalityfn prev) s
            let valid2 = if Seq.isEmpty valid  then Seq.singleton prev else (Seq.last valid) |> Seq.singleton
            let filtered = if Seq.isEmpty rest then Seq.empty else dedupe equalityfn (Seq.head rest) (rest)
            Seq.append valid2 filtered
    
    let t = [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)]
            |> dedupe (fun (x1, y1) (x2, y2) -> x1=x2) ("asdfasdf",1)
            |> List.ofSeq;;
    
    #time
    List.init 1000 (fun _ -> 1)
    |> dedupe (fun x y -> x = y) (189234784)
    |> List.ofSeq
    #time;;
    --> Timing now on
    
    Real: 00:00:00.007, CPU: 00:00:00.006, GC gen0: 0, gen1: 0
    val it : int list = [189234784; 1]
    
    --> Timing now off
    
    0 讨论(0)
  • 2021-01-12 04:10

    As the other answers have said, seq are really slow. However, the real question is why do you want to use a seq here? In particular, you start with a list and you want to traverse the entire list and you want to create a new list at the end. There doesn't appear to be any reason to use a sequence at all unless you want to use sequence specific features. In fact, the docs state that (emphasis mine):

    A sequence is a logical series of elements all of one type. Sequences are particularly useful when you have a large, ordered collection of data but do not necessarily expect to use all the elements. Individual sequence elements are computed only as required, so a sequence can provide better performance than a list in situations in which not all the elements are used.

    0 讨论(0)
  • 2021-01-12 04:10

    To make efficient use of the input type Seq, one should iterate through each element only once and avoid creating additional sequences.

    On the other side, to make efficient use of the output type List, one should make liberal use of the cons and tail functions, both of which are basically free.

    Combining the two requirements leads me to this solution:

    // dedupeTakingLast2 : ('a -> 'a -> bool) -> seq<'a> -> 'a list
    let dedupeTakingLast2 equalityFn = 
      Seq.fold 
      <| fun deduped elem ->     
           match deduped with
           | [] -> [ elem ]
           | x :: xs -> if equalityFn x elem 
                          then elem :: xs
                          else elem :: deduped
      <| []
    

    Note however, that the outputted list will be in reverse order, due to list prepending. I hope this isn't a dealbreaker, since List.rev is a relatively expensive operation.

    Test:

    List.init 1000 (id) 
    |> dedupeTakingLast2 (fun x y -> x - (x % 10) = y - (y % 10))
    |> List.iter (printfn "%i ")
    
    // 999 989 979 969 etc...
    
    0 讨论(0)
提交回复
热议问题