How to write a crawler?

后端未结

关注

 10  1810

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO\'s websites and content.

Does anybody have an

相关标签:

10条回答

无人及你

2020-12-02 04:18

I'm using Open search server for my company internal search, try this : http://open-search-server.com its also open soruce.

0 讨论(0)
发布评论:

提交评论
- 加载中...

悲&欢浪女

2020-12-02 04:21

i did a simple web crawler using reactive extension in .net.

https://github.com/Misterhex/WebCrawler

public class Crawler
    {
    class ReceivingCrawledUri : ObservableBase<Uri>
    {
        public int _numberOfLinksLeft = 0;

        private ReplaySubject<Uri> _subject = new ReplaySubject<Uri>();
        private Uri _rootUri;
        private IEnumerable<IUriFilter> _filters;

        public ReceivingCrawledUri(Uri uri)
            : this(uri, Enumerable.Empty<IUriFilter>().ToArray())
        { }

        public ReceivingCrawledUri(Uri uri, params IUriFilter[] filters)
        {
            _filters = filters;

            CrawlAsync(uri).Start();
        }

        protected override IDisposable SubscribeCore(IObserver<Uri> observer)
        {
            return _subject.Subscribe(observer);
        }

        private async Task CrawlAsync(Uri uri)
        {
            using (HttpClient client = new HttpClient() { Timeout = TimeSpan.FromMinutes(1) })
            {
                IEnumerable<Uri> result = new List<Uri>();

                try
                {
                    string html = await client.GetStringAsync(uri);
                    result = CQ.Create(html)["a"].Select(i => i.Attributes["href"]).SafeSelect(i => new Uri(i));
                    result = Filter(result, _filters.ToArray());

                    result.ToList().ForEach(async i =>
                    {
                        Interlocked.Increment(ref _numberOfLinksLeft);
                        _subject.OnNext(i);
                        await CrawlAsync(i);
                    });
                }
                catch
                { }

                if (Interlocked.Decrement(ref _numberOfLinksLeft) == 0)
                    _subject.OnCompleted();
            }
        }

        private static List<Uri> Filter(IEnumerable<Uri> uris, params IUriFilter[] filters)
        {
            var filtered = uris.ToList();
            foreach (var filter in filters.ToList())
            {
                filtered = filter.Filter(filtered);
            }
            return filtered;
        }
    }

    public IObservable<Uri> Crawl(Uri uri)
    {
        return new ReceivingCrawledUri(uri, new ExcludeRootUriFilter(uri), new ExternalUriFilter(uri), new AlreadyVisitedUriFilter());
    }

    public IObservable<Uri> Crawl(Uri uri, params IUriFilter[] filters)
    {
        return new ReceivingCrawledUri(uri, filters);
    }
}

and you can use it as follows:

Crawler crawler = new Crawler();
IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));
observable.Subscribe(onNext: Console.WriteLine, 
onCompleted: () => Console.WriteLine("Crawling completed"));

0 讨论(0)

夕颜

2020-12-02 04:24

You could make a list of words and make a thread for each word searched at google.
Then each thread will create a new thread for each link it find in the page.
Each thread should write what it finds in a database. When each thread finishes reading the page, it terminates.
And there you have a very big database of links in your database.

0 讨论(0)
发布评论:

提交评论
- 加载中...
南方客

2020-12-02 04:28

Crawlers are simple in concept.

You get a root page via a HTTP GET, parse it to find URLs and put them on a queue unless they've been parsed already (so you need a global record of pages you have already parsed).

You can use the Content-type header to find out what the type of content is, and limit your crawler to only parsing the HTML types.

You can strip out the HTML tags to get the plain text, which you can do text analysis on (to get tags, etc, the meat of the page). You could even do that on the alt/title tags for images if you got that advanced.

And in the background you can have a pool of threads eating URLs from the Queue and doing the same. You want to limit the number of threads of course.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一生所求

2020-12-02 04:31
Multithreaded Web Crawler

If you want to crawl large sized website then you should write a multi-threaded crawler. connecting,fetching and writing crawled information in files/database - these are the three steps of crawling but if you use a single threaded than your CPU and network utilization will be pour.

A multi threaded web crawler needs two data structures- linksVisited(this should be implemented as a hashmap or trai) and linksToBeVisited(this is a queue).

Web crawler uses BFS to traverse world wide web.

Algorithm of a basic web crawler:-
1. Add one or more seed urls to linksToBeVisited. The method to add a url to linksToBeVisited must be synchronized.
2. Pop an element from linksToBeVisited and add this to linksVisited. This pop method to pop url from linksToBeVisited must be synchronized.
3. Fetch the page from internet.
4. Parse the file and add any till now not visited link found in the page to linksToBeVisited. URL's can be filtered if needed. The user can give a set of rules to filter which url's to be scanned.
5. The necessary information found on the page is saved in database or file.
6. repeat step 2 to 5 until queue is linksToBeVisited empty.
  
  Here is a code snippet on how to synchronize the threads....
```
 public void add(String site) {
   synchronized (this) {
   if (!linksVisited.contains(site)) {
     linksToBeVisited.add(site);
     }
   }
 }

 public String next() {
    if (linksToBeVisited.size() == 0) {
    return null;
    }
       synchronized (this) {
        // Need to check again if size has changed
       if (linksToBeVisited.size() > 0) {
          String s = linksToBeVisited.get(0);
          linksToBeVisited.remove(0);
          linksVisited.add(s);
          return s;
       }
     return null;
     }
  }
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
再見小時候

2020-12-02 04:31

Use wget, do a recursive web suck, which will dump all the files onto your harddrive, then write another script to go through all the downloaded files and analyze them.

Edit: or maybe curl instead of wget, but I am not familiar with curl, I do not know if it does recursive downloads like wget.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页