Webcrawler in Go

后端未结

关注

 2  451

I\'m trying to build a web crawler in Go where I would like to specify the max number of concurrent workers. They will all be working as long as there are link to explore in

相关标签:

2条回答

攒了一身酷

2021-01-06 13:36
If you use your favourite web search for "Go web crawler" (or "golang web crawler") you'll find many examples including: Go Tour Exercise: Web Crawler. There are also some talks on concurrency in Go that cover this kind of thing.

The "standard" way to do this in Go does not need to involve wait groups at all. To answer one of your questions, things queued with defer only get run when the function returns. You have a long running function so do not use defer in such a loop.

The "standard" way is to start up however many workers you want in their own goroutines. They all read "jobs" from the same channel, blocking if/when there is nothing to do. When fully done that channel is closed and they all exit.

In the case of something like a crawler the workers will discover more "jobs" to do and want to enqueue them. You don't want them writing back to the same channel since it will have some limited amount of buffering (or none!) and you'll eventually block all the workers trying to enqueue more jobs!

A simple solution to this is to use a separate channel (e.g. each worker has in <-chan Job, out chan<- Job) and a single queue/filter goroutine that reads these requests, appends them onto a slice that it either lets grow arbitrarily large or does some global limiting on, and also feeds the other channel from the head of the slice (i.e. a simple for-select loop reading from one channel and writing to the other). This code is also usually responsible for keeping track of what has been already done (e.g. a map of URLs visited) and drops incoming requests for duplicates.

The queue goroutine might look something like this (argument names excessively verbose here):
```
type Job string

func queue(toWorkers chan<- Job, fromWorkers <-chan Job) {
    var list []Job
    done := make(map[Job]bool)
    for {
        var send chan<- Job
        var item Job
        if len(list) > 0 {
            send = toWorkers
            item = list[0]
        }
        select {
        case send <- item:
            // We sent an item, remove it
            list = list[1:]
        case thing := <-fromWorkers:
            // Got a new thing
            if !done[thing] {
                list = append(list, thing)
                done[thing] = true
            }
        }
    }
}
```
A few things are glossed over in this simple example. Such as termination. And if "Jobs" is some larger structure where you'd want to use chan *Job and []*Job instead. In that case you'd also need to change the map type to some some key you extract from the job (e.g. Job.URL perhaps) and you'd want to do list[0] = nil before list = list[1:] to get rid of the reference to *Job pointer and let the garbage collector at it earlier.

Edit: Some notes on terminating cleanly.

There are several ways to terminate code like the above cleanly. A wait group could be used, but the placement of the Add/Done calls needs to be done carefully and you'd probably need another goroutine to do the Wait (and then close one of the channels to start the shutdown). The workers shouldn't close their output channel since there are multiple workers and you can't close a channel more than once; the queue goroutine can't tell when to close it's channel to the workers without knowing when the workers are done.

In the past when I've used code very similar to the above I used a local "outstanding" counter within the "queue" goroutine (which avoids any need for a mutex or any synchronization overhead that a wait group has). The count of outstanding jobs is increased when a job is sent to a worker. It's decreased again when the worker says it's finished with it. My code happened to have another channel for this (my "queue" was also collecting results in addition to further nodes to enqueue). It's probably cleaner on it's own channel, but instead a special value on the existing channel (e.g. a nil Job pointer) could be used. At any rate, with such a counter, the existing length check on the local list just needs to see there is nothing outstanding when the list is empty and it's time to terminate; just shutdown the channel to the workers and return.

E.g.:
```
    if len(list) > 0 {
        send = toWorkers
        item = list[0]
    } else if outstandingJobs == 0 {
        close(toWorkers)
        return
    }
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

猫巷女王i

2021-01-06 13:51

I wrote a solution utilizing the mutual exclusion (Mutex) function of Go.

When it runs on the concurrency, it may be important to restrict only one instance access the url map at a time. I believe I implemented it as written below. Please feel free to try this out. I would appreciate your feedback as I will be learn from your comments as well.

package main

import (
    "fmt"
    "sync"
)

type Fetcher interface {
    // Fetch returns the body of URL and
    // a slice of URLs found on that page.
    Fetch(url string) (body string, urls []string, err error)
}




// ! SafeUrlBook helps restrict only one instance access the central url map at a time. So that no redundant crawling should occur.
type SafeUrlBook struct {
    book map[string]bool
    mux  sync.Mutex
    }

func (sub *SafeUrlBook) doesThisExist(url string) bool {
    sub.mux.Lock()
    _ , key_exists := sub.book[url]
    defer sub.mux.Unlock()
    
    if key_exists {
    return true
    }  else { 
    sub.book[url] = true
    return false 
    }  
}
// End SafeUrlBook


// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
// Note that now I use safeBook (SafeUrlBook) to keep track of which url has been visited by a crawler.
func Crawl(url string, depth int, fetcher Fetcher, safeBook SafeUrlBook) {
    if depth <= 0 {
        return
    }
    
    
    exist := safeBook.doesThisExist(url)
    if exist { fmt.Println("Skip", url) ; return }
    
    
    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf("found: %s %q\n", url, body)
    for _, u := range urls {
        Crawl(u, depth-1, fetcher, safeBook)
    }
    return
}

func main() {
    safeBook := SafeUrlBook{book: make(map[string]bool)}
    Crawl("https://golang.org/", 4, fetcher, safeBook)
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
    body string
    urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string, error) {
    if res, ok := f[url]; ok {
        return res.body, res.urls, nil
    }
    return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
    "https://golang.org/": &fakeResult{
        "The Go Programming Language",
        []string{
            "https://golang.org/pkg/",
            "https://golang.org/cmd/",
        },
    },
    "https://golang.org/pkg/": &fakeResult{
        "Packages",
        []string{
            "https://golang.org/",
            "https://golang.org/cmd/",
            "https://golang.org/pkg/fmt/",
            "https://golang.org/pkg/os/",
        },
    },
    "https://golang.org/pkg/fmt/": &fakeResult{
        "Package fmt",
        []string{
            "https://golang.org/",
            "https://golang.org/pkg/",
        },
    },
    "https://golang.org/pkg/os/": &fakeResult{
        "Package os",
        []string{
            "https://golang.org/",
            "https://golang.org/pkg/",
        },
    },
}

0 讨论(0)