How to write a crawler?

后端未结

关注

 10  1811

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO\'s websites and content.

Does anybody have an

相关标签:

10条回答

盖世英雄少女心

2020-12-02 04:34
If your NPO's sites are relatively big or complex (having dynamic pages that'll effectively create a 'black hole' like a calendar with a 'next day' link) you'd be better using a real web crawler, like Heritrix.

If the sites total a few number of pages you can get away with just using curl or wget or your own. Just remember if they start to get big or you start making your script more complex to just use a real crawler or at least look at its source to see what are they doing and why.

Some issues (there are more):
- Black holes (as described)
- Retries (what if you get a 500?)
- Redirects
- Flow control (else you can be a burden on the sites)
- robots.txt implementation
0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2020-12-02 04:35
The complicated part of a crawler is if you want to scale it to a huge number of websites/requests. In this situation you will have to deal with some issues like:
- Impossibility to keep info all in one database.
- Not enough RAM to deal with huge index(s)
- Multithread performance and concurrency
- Crawler traps (infinite loop created by changing urls, calendars, sessions ids...) and duplicated content.
- Crawl from more than one computer
- Malformed HTML codes
- Constant http errors from servers
- Databases without compression, wich make your need for space about 8x bigger.
- Recrawl routines and priorities.
- Use requests with compression (Deflate/gzip) (good for any kind of crawler).
And some important things
- Respect robots.txt
- And a crawler delay on each request to dont suffocate web servers.
0 讨论(0)
发布评论:

提交评论
- 加载中...
鱼传尺愫

2020-12-02 04:38

Wikipedia has a good article about web crawlers, covering many of the algorithms and considerations.

However, I wouldn't bother writing my own crawler. It's a lot of work, and since you only need a "simple crawler", I'm thinking all you really need is an off-the-shelf crawler. There are a lot of free and open-source crawlers that will likely do everything you need, with very little work on your part.

0 讨论(0)
发布评论:

提交评论
- 加载中...

無奈伤痛

2020-12-02 04:42

You'll be reinventing the wheel, to be sure. But here's the basics:

A list of unvisited URLs - seed this with one or more starting pages
A list of visited URLs - so you don't go around in circles
A set of rules for URLs you're not interested in - so you don't index the whole Internet

Put these in persistent storage, so you can stop and start the crawler without losing state.

Algorithm is:

while(list of unvisited URLs is not empty) {
    take URL from list
    remove it from the unvisited list and add it to the visited list
    fetch content
    record whatever it is you want to about the content
    if content is HTML {
        parse out URLs from links
        foreach URL {
           if it matches your rules
              and it's not already in either the visited or unvisited list
              add it to the unvisited list
        }
    }
}

0 讨论(0)

上一页 1 2