web-crawler | 易学教程

Web Crawling (Ajax/JavaScript enabled pages) using java

阅读更多关于 Web Crawling (Ajax/JavaScript enabled pages) using java

问题 I am very new to this web crawling. I am using crawler4j to crawl the websites. I am collecting the required information by crawling these sites. My problem here is I was unable to crawl the content for the following site. http://www.sciencedirect.com/science/article/pii/S1568494612005741 . I want to crawl the following information from the aforementioned site (Please take a look at the attached screenshot). If you observe the attached screenshot it has three names (Highlighted in red boxes).

How to write a crawler?

阅读更多关于 How to write a crawler?

问题 I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc. 回答1: You'll be reinventing the wheel, to be sure. But here's the basics: A list of unvisited URLs - seed this with one or more starting pages A list of

how to bypass robots.txt with apache nutch 2.2.1

阅读更多关于 how to bypass robots.txt with apache nutch 2.2.1

问题 Can anyone please tell me if there is any way for apache nutch to ignore or bypass robots.txt while crawling. I am using nutch 2.2.1. I found that "RobotRulesParser.java"(full path:-src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ RobotRulesParser.java) is responsible for the reading and parsing the robots.txt. Is there any way to modify this file to ignore robots.txt and go on with crawling? Or is there any other way to achieve the same? 回答1: At first, we should respect the

Can I extract comments of any page from https://www.rt.com/ using python3?

阅读更多关于 Can I extract comments of any page from https://www.rt.com/ using python3?

问题 I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it? 回答1: RT are using a service from spot.im for comments you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.

Crawl and scrap a complete site with scrapy

阅读更多关于 Crawl and scrap a complete site with scrapy

问题 import scrapy from scrapy import Request #scrapy crawl jobs9 -o jobs9.csv -t csv class JobsSpider(scrapy.Spider): name = "jobs9" allowed_domains = ["vapedonia.com"] start_urls = ["https://www.vapedonia.com/7-principiantes-kit-s-de-inicio-", "https://www.vapedonia.com/10-cigarrillos-electronicos-", "https://www.vapedonia.com/11-mods-potencia-", "https://www.vapedonia.com/12-consumibles", "https://www.vapedonia.com/13-baterias", "https://www.vapedonia.com/23-e-liquidos", "https://www.vapedonia

crawling with Nutch 2.3, Cassandra 2.0, and solr 4.10.3 returns 0 results

阅读更多关于 crawling with Nutch 2.3, Cassandra 2.0, and solr 4.10.3 returns 0 results

问题 I mainly followed the guide on this page. I installed Nutch 2.3, Cassandra 2.0, and solr 4.10.3. Set up went well. But when I executed the following command. No urls were fetched. ./bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2 Below are my settings. nutch-site.xml http://ideone.com/H8MPcl regex-urlfilter.txt +^http://([a-z0-9]*\.)*nutch.apache.org/ hadoop.log http://ideone.com/LnpAw4 I don't see any errors in the log file. I am really lost. Any help would be appreciated.

Best method of moving to a new page with request-promise?

阅读更多关于 Best method of moving to a new page with request-promise?

问题 I am tinkering around with request-promise to crawl a friends webpage. I am using the crawl a webpage better example on their GitHub. What I have so far is this: var rp = require('request-promise'); var cheerio = require('cheerio'); // Basically jQuery for node.js var options = { uri: 'https://friendspage.org', transform: function(body) { return cheerio.load(body); } }; rp(options) .then(function($) { // Process html like you would with jQuery... var nxtPage = $("a[data-url$='nxtPageId']")

Explicit special characters from crawling

阅读更多关于 Explicit special characters from crawling

问题 Working on Storm Crawler 1.13 and elastic search 6.5.2. How to restrict the crawler not to crawl/index the special characters � � � � � �� • 回答1: An easy way to do this is to write a ParseFilter like ParseData pd = parse.get(URL); String text = pd.getText(); // remove chars pd.setText(text); This will get called on documents parsed by JSoup or Tika. Have a look at the parse filters in the repository for examples. 来源： https://stackoverflow.com/questions/54096045/explicit-special

Jsoup returns Status 400

阅读更多关于 Jsoup returns Status 400

问题 I want to crawl data from this URL : http://www.expedia.co.jp/Osaka-Hotels-Hotel-Consort.h5522663.Hotel-Information?chkin=2017/12/13&chkout=2017/12/14&rm1=a2 So I have written following code import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import static org.jsoup.Connection.*; /** * Created by avi on 11/24/17. */ public class ExpediaCurl { public static void main(String[] args) { final String USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit

Is there any javascript (and client-side) wget implementation?

阅读更多关于 Is there any javascript (and client-side) wget implementation?

问题 In order to provide a service for webmasters, I need to download the public part of their site. I'm currently doing it using wget on my server, but it introduce a lot of load, and I'd like to move that part to the client side. Does an implementation of wget exists in Javascript? If it exists, I could zip the files and send them to my server for processing, that would allow me to concentrate on the core business for my app. I know some compression library exists in Js (such as zip.js), but I