“Throttled” async download in F#

前端未结

关注

 4  2020

I\'m trying to download the 3000+ photos referenced from the xml backup of my blog. The problem I came across is that if just one of those photos is no longer available, the

相关标签:

4条回答

时光说笑

2020-11-30 06:45

Nothing is ever easy. :)

I think the issues you're hitting are intrinsic to the problem domain (as opposed to merely being issues with the async programming model, though they do interact somewhat).

Say you want to download 3000 pictures. First, in your .NET process, there is something like System.Net.ConnectionLimit or something I forget the name of, that will e.g. throttle the number of simultaneous HTTP connections your .NET process can run simultaneously (and the default is just '2' I think). So you could find that control and set it to a higher number, and it would help.

But then next, your machine and internet connection have finite bandwidth. So even if you could try to concurrently start 3000 HTTP connections, each individual connection would get slower based on the bandwidth pipe limitations. So this would also interact with timeouts. (And this doesn't even consider what kinds of throttles/limits are on the server. Maybe if you send 3000 requests it will think you are DoS attacking and blacklist your IP.)

So this is really a problem domain where a good solution requires some intelligent throttling and flow-control in order to manage how the underlying system resources are used.

As in the other answer, F# agents (MailboxProcessors) are a good programming model for authoring such throttling/flow-control logic.

(Even with all that, if most picture files are like 1MB but then there is a 1GB file mixed in there, that single file might trip a timeout.)

Anyway, this is not so much an answer to the question, as just pointing out how much intrinsic complexity there is in the problem domain itself. (Perhaps it's also suggestive of why UI 'download managers' are so popular.)

0 讨论(0)
发布评论:

提交评论
- 加载中...

花落未央

2020-11-30 06:47

I think there must be a better way to find out that a file is not available than using a timeout. I'm not exactly sure, but is there some way to make it throw an exception if a file cannot be found? Then you could just wrap your async code inside try .. with and you should avoid most of the problems.

Anyway, if you want to write your own "concurrency manager" that runs certain number of requests in parallel and queues remaining pending requests, then the easiest option in F# is to use agents (the MailboxProcessor type). The following object encapsulates the behavior:

type ThrottlingAgentMessage = 
  | Completed
  | Work of Async<unit>

/// Represents an agent that runs operations in concurrently. When the number
/// of concurrent operations exceeds 'limit', they are queued and processed later
type ThrottlingAgent(limit) = 
  let agent = MailboxProcessor.Start(fun agent -> 
    /// Represents a state when the agent is blocked
    let rec waiting () = 
      // Use 'Scan' to wait for completion of some work
      agent.Scan(function
        | Completed -> Some(working (limit - 1))
        | _ -> None)
    /// Represents a state when the agent is working
    and working count = async { 
      while true do
        // Receive any message 
        let! msg = agent.Receive()
        match msg with 
        | Completed -> 
            // Decrement the counter of work items
            return! working (count - 1)
        | Work work ->
            // Start the work item & continue in blocked/working state
            async { try do! work 
                    finally agent.Post(Completed) }
            |> Async.Start
            if count < limit then return! working (count + 1)
            else return! waiting () }
    working 0)      

  /// Queue the specified asynchronous workflow for processing
  member x.DoWork(work) = agent.Post(Work work)

0 讨论(0)

刺人心

2020-11-30 06:47

Here's a variation on Tomas's answer, because I needed an agent which could return results.

type ThrottleMessage<'a> = 
    | AddJob of (Async<'a>*AsyncReplyChannel<'a>) 
    | DoneJob of ('a*AsyncReplyChannel<'a>) 
    | Stop

/// This agent accumulates 'jobs' but limits the number which run concurrently.
type ThrottleAgent<'a>(limit) =    
    let agent = MailboxProcessor<ThrottleMessage<'a>>.Start(fun inbox ->
        let rec loop(jobs, count) = async {
            let! msg = inbox.Receive()  //get next message
            match msg with
            | AddJob(job) -> 
                if count < limit then   //if not at limit, we work, else loop
                    return! work(job::jobs, count)
                else
                    return! loop(job::jobs, count)
            | DoneJob(result, reply) -> 
                reply.Reply(result)           //send back result to caller
                return! work(jobs, count - 1) //no need to check limit here
            | Stop -> return () }
        and work(jobs, count) = async {
            match jobs with
            | [] -> return! loop(jobs, count) //if no jobs left, wait for more
            | (job, reply)::jobs ->          //run job, post Done when finished
                async { let! result = job 
                        inbox.Post(DoneJob(result, reply)) }
                |> Async.Start
                return! loop(jobs, count + 1) //job started, go back to waiting
        }
        loop([], 0)
    )
    member m.AddJob(job) = agent.PostAndAsyncReply(fun rep-> AddJob(job, rep))
    member m.Stop() = agent.Post(Stop)

In my particular case, I just need to use it as a 'one shot' 'map', so I added a static function:

    static member RunJobs limit jobs = 
        let agent = ThrottleAgent<'a>(limit)
        let res = jobs |> Seq.map (fun job -> agent.AddJob(job))
                       |> Async.Parallel
                       |> Async.RunSynchronously
        agent.Stop()
        res

It seems to work ok...

0 讨论(0)

悲&欢浪女

2020-11-30 07:00

Here's an out of the box solution:

FSharpx.Control offers an Async.ParallelWithThrottle function. I'm not sure if it is the best implementation as it uses SemaphoreSlim. But the ease of use is great and since my application doesn't need top performance it works well enough for me. Although since it is a library if someone knows how to make it better it is always a nice thing to make libraries top performers out of the box so the rest of us can just use the code that works and just get our work done!

0 讨论(0)
发布评论:

提交评论
- 加载中...