Webcrawler in Go

后端 未结 2 452
梦如初夏
梦如初夏 2021-01-06 12:48

I\'m trying to build a web crawler in Go where I would like to specify the max number of concurrent workers. They will all be working as long as there are link to explore in

2条回答
  •  攒了一身酷
    2021-01-06 13:36

    If you use your favourite web search for "Go web crawler" (or "golang web crawler") you'll find many examples including: Go Tour Exercise: Web Crawler. There are also some talks on concurrency in Go that cover this kind of thing.

    The "standard" way to do this in Go does not need to involve wait groups at all. To answer one of your questions, things queued with defer only get run when the function returns. You have a long running function so do not use defer in such a loop.

    The "standard" way is to start up however many workers you want in their own goroutines. They all read "jobs" from the same channel, blocking if/when there is nothing to do. When fully done that channel is closed and they all exit.

    In the case of something like a crawler the workers will discover more "jobs" to do and want to enqueue them. You don't want them writing back to the same channel since it will have some limited amount of buffering (or none!) and you'll eventually block all the workers trying to enqueue more jobs!

    A simple solution to this is to use a separate channel (e.g. each worker has in <-chan Job, out chan<- Job) and a single queue/filter goroutine that reads these requests, appends them onto a slice that it either lets grow arbitrarily large or does some global limiting on, and also feeds the other channel from the head of the slice (i.e. a simple for-select loop reading from one channel and writing to the other). This code is also usually responsible for keeping track of what has been already done (e.g. a map of URLs visited) and drops incoming requests for duplicates.

    The queue goroutine might look something like this (argument names excessively verbose here):

    type Job string
    
    func queue(toWorkers chan<- Job, fromWorkers <-chan Job) {
        var list []Job
        done := make(map[Job]bool)
        for {
            var send chan<- Job
            var item Job
            if len(list) > 0 {
                send = toWorkers
                item = list[0]
            }
            select {
            case send <- item:
                // We sent an item, remove it
                list = list[1:]
            case thing := <-fromWorkers:
                // Got a new thing
                if !done[thing] {
                    list = append(list, thing)
                    done[thing] = true
                }
            }
        }
    }
    

    A few things are glossed over in this simple example. Such as termination. And if "Jobs" is some larger structure where you'd want to use chan *Job and []*Job instead. In that case you'd also need to change the map type to some some key you extract from the job (e.g. Job.URL perhaps) and you'd want to do list[0] = nil before list = list[1:] to get rid of the reference to *Job pointer and let the garbage collector at it earlier.

    Edit: Some notes on terminating cleanly.

    There are several ways to terminate code like the above cleanly. A wait group could be used, but the placement of the Add/Done calls needs to be done carefully and you'd probably need another goroutine to do the Wait (and then close one of the channels to start the shutdown). The workers shouldn't close their output channel since there are multiple workers and you can't close a channel more than once; the queue goroutine can't tell when to close it's channel to the workers without knowing when the workers are done.

    In the past when I've used code very similar to the above I used a local "outstanding" counter within the "queue" goroutine (which avoids any need for a mutex or any synchronization overhead that a wait group has). The count of outstanding jobs is increased when a job is sent to a worker. It's decreased again when the worker says it's finished with it. My code happened to have another channel for this (my "queue" was also collecting results in addition to further nodes to enqueue). It's probably cleaner on it's own channel, but instead a special value on the existing channel (e.g. a nil Job pointer) could be used. At any rate, with such a counter, the existing length check on the local list just needs to see there is nothing outstanding when the list is empty and it's time to terminate; just shutdown the channel to the workers and return.

    E.g.:

        if len(list) > 0 {
            send = toWorkers
            item = list[0]
        } else if outstandingJobs == 0 {
            close(toWorkers)
            return
        }
    

提交回复
热议问题