is it possible to write web crawler in javascript?

后端未结

关注

 11  587

I want to crawl the page and check for the hyperlinks in that respective page and also follow those hyperlinks and capture data from the page

相关标签:

11条回答

野性不改

2021-02-01 08:24

This is what you need http://zugravu.com/products/web-crawler-spider-scraping-javascript-regular-expression-nodejs-mongodb They use NodeJS, MongoDB and ExtJs as GUI

0 讨论(0)
发布评论:

提交评论
- 加载中...
一生所求

2021-02-01 08:25
yes it is possible
1. Use NODEJS (its server side JS)
2. There is NPM (package manager that handles 3rd party modules) in nodeJS
3. Use PhantomJS in NodeJS (third party module that can crawl through websites is PhantomJS)
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲哀的现实

2021-02-01 08:26

If you use server-side javascript it is possible. You should take a look at node.js

And an example of a crawler can be found in the link bellow:

http://www.colourcoding.net/blog/archive/2010/11/20/a-node.js-web-spider.aspx

0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2021-02-01 08:27

You can make a web crawler driven from a remote json file that opens all links from a page in new tabs as soon as each tab loads except ones that have already been opened. If you set up a with a browser extension running in a basic browser (nothing runs except the web browser and an internet config program) and had it shipped and installed somewhere with good internet, you could make a database of webpages with an old computer. That would just need to retrieve the content of each tab. You could do that for about $2000, contrary to most estimates for search engine costs. You'd just need to basically make your algorithm provide pages based on how much a term appears in the innerText property of the page, keywords, and description. You could also set up another PC to recrawl old pages from the one-time database and add more. I'd estimate it would take about 3 months and $20000, maximum.

0 讨论(0)
发布评论:

提交评论
- 加载中...

闹比i

2021-02-01 08:29

I made an example javascript crawler on github.

It's event driven and use an in-memory queue to store all the resources(ie. urls).

How to use in your node environment

var Crawler = require('../lib/crawler')
var crawler = new Crawler('http://www.someUrl.com');

// crawler.maxDepth = 4;
// crawler.crawlInterval = 10;
// crawler.maxListenerCurrency = 10;
// crawler.redisQueue = true;
crawler.start();

Here I'm just showing you 2 core method of a javascript crawler.

Crawler.prototype.run = function() {
  var crawler = this;
  process.nextTick(() => {
    //the run loop
    crawler.crawlerIntervalId = setInterval(() => {

      crawler.crawl();

    }, crawler.crawlInterval);
    //kick off first one
    crawler.crawl();
  });

  crawler.running = true;
  crawler.emit('start');
}


Crawler.prototype.crawl = function() {
  var crawler = this;

  if (crawler._openRequests >= crawler.maxListenerCurrency) return;


  //go get the item
  crawler.queue.oldestUnfetchedItem((err, queueItem, index) => {
    if (queueItem) {
      //got the item start the fetch
      crawler.fetchQueueItem(queueItem, index);
    } else if (crawler._openRequests === 0) {
      crawler.queue.complete((err, completeCount) => {
        if (err)
          throw err;
        crawler.queue.getLength((err, length) => {
          if (err)
            throw err;
          if (length === completeCount) {
            //no open Request, no unfetcheditem stop the crawler
            crawler.emit("complete", completeCount);
            clearInterval(crawler.crawlerIntervalId);
            crawler.running = false;
          }
        });
      });
    }

  });
};

Here is the github link https://github.com/bfwg/node-tinycrawler. It is a javascript web crawler written under 1000 lines of code. This should put you on the right track.

0 讨论(0)

隐瞒了意图╮

2021-02-01 08:32

There are ways to circumvent the same-origin policy with JS. I wrote a crawler for facebook, that gathered information from facebook profiles from my friends and my friend's friends and allowed filtering the results by gender, current location, age, martial status (you catch my drift). It was simple. I just ran it from console. That way your script will get privilage to do request on the current domain. You can also make a bookmarklet to run the script from your bookmarks.

Another way is to provide a PHP proxy. Your script will access the proxy on current domain and request files from another with PHP. Just be carefull with those. These might get hijacked and used as a public proxy by 3rd party if you are not carefull.

Good luck, maybe you make a friend or two in the process like I did :-)

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页