web-crawler

Web Crawling (Ajax/JavaScript enabled pages) using java

纵然是瞬间 提交于 2020-01-09 09:37:08
问题 I am very new to this web crawling. I am using crawler4j to crawl the websites. I am collecting the required information by crawling these sites. My problem here is I was unable to crawl the content for the following site. http://www.sciencedirect.com/science/article/pii/S1568494612005741 . I want to crawl the following information from the aforementioned site (Please take a look at the attached screenshot). If you observe the attached screenshot it has three names (Highlighted in red boxes).

How to write a crawler?

拜拜、爱过 提交于 2020-01-09 04:00:38
问题 I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc. 回答1: You'll be reinventing the wheel, to be sure. But here's the basics: A list of unvisited URLs - seed this with one or more starting pages A list of

how to bypass robots.txt with apache nutch 2.2.1

北城余情 提交于 2020-01-07 06:46:10
问题 Can anyone please tell me if there is any way for apache nutch to ignore or bypass robots.txt while crawling. I am using nutch 2.2.1. I found that "RobotRulesParser.java"(full path:-src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ RobotRulesParser.java) is responsible for the reading and parsing the robots.txt. Is there any way to modify this file to ignore robots.txt and go on with crawling? Or is there any other way to achieve the same? 回答1: At first, we should respect the

Can I extract comments of any page from https://www.rt.com/ using python3?

不打扰是莪最后的温柔 提交于 2020-01-07 06:25:27
问题 I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it? 回答1: RT are using a service from spot.im for comments you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.

Crawl and scrap a complete site with scrapy

南笙酒味 提交于 2020-01-07 04:24:17
问题 import scrapy from scrapy import Request #scrapy crawl jobs9 -o jobs9.csv -t csv class JobsSpider(scrapy.Spider): name = "jobs9" allowed_domains = ["vapedonia.com"] start_urls = ["https://www.vapedonia.com/7-principiantes-kit-s-de-inicio-", "https://www.vapedonia.com/10-cigarrillos-electronicos-", "https://www.vapedonia.com/11-mods-potencia-", "https://www.vapedonia.com/12-consumibles", "https://www.vapedonia.com/13-baterias", "https://www.vapedonia.com/23-e-liquidos", "https://www.vapedonia

crawling with Nutch 2.3, Cassandra 2.0, and solr 4.10.3 returns 0 results

守給你的承諾、 提交于 2020-01-06 23:43:25
问题 I mainly followed the guide on this page. I installed Nutch 2.3, Cassandra 2.0, and solr 4.10.3. Set up went well. But when I executed the following command. No urls were fetched. ./bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2 Below are my settings. nutch-site.xml http://ideone.com/H8MPcl regex-urlfilter.txt +^http://([a-z0-9]*\.)*nutch.apache.org/ hadoop.log http://ideone.com/LnpAw4 I don't see any errors in the log file. I am really lost. Any help would be appreciated.

Best method of moving to a new page with request-promise?

房东的猫 提交于 2020-01-06 14:57:12
问题 I am tinkering around with request-promise to crawl a friends webpage. I am using the crawl a webpage better example on their GitHub. What I have so far is this: var rp = require('request-promise'); var cheerio = require('cheerio'); // Basically jQuery for node.js var options = { uri: 'https://friendspage.org', transform: function(body) { return cheerio.load(body); } }; rp(options) .then(function($) { // Process html like you would with jQuery... var nxtPage = $("a[data-url$='nxtPageId']")

Explicit special characters from crawling

。_饼干妹妹 提交于 2020-01-06 12:25:33
问题 Working on Storm Crawler 1.13 and elastic search 6.5.2. How to restrict the crawler not to crawl/index the special characters � � � � � ��� �� � • 回答1: An easy way to do this is to write a ParseFilter like ParseData pd = parse.get(URL); String text = pd.getText(); // remove chars pd.setText(text); This will get called on documents parsed by JSoup or Tika. Have a look at the parse filters in the repository for examples. 来源: https://stackoverflow.com/questions/54096045/explicit-special

Jsoup returns Status 400

懵懂的女人 提交于 2020-01-06 11:25:11
问题 I want to crawl data from this URL : http://www.expedia.co.jp/Osaka-Hotels-Hotel-Consort.h5522663.Hotel-Information?chkin=2017/12/13&chkout=2017/12/14&rm1=a2 So I have written following code import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import static org.jsoup.Connection.*; /** * Created by avi on 11/24/17. */ public class ExpediaCurl { public static void main(String[] args) { final String USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit

Is there any javascript (and client-side) wget implementation?

感情迁移 提交于 2020-01-06 09:03:49
问题 In order to provide a service for webmasters, I need to download the public part of their site. I'm currently doing it using wget on my server, but it introduce a lot of load, and I'd like to move that part to the client side. Does an implementation of wget exists in Javascript? If it exists, I could zip the files and send them to my server for processing, that would allow me to concentrate on the core business for my app. I know some compression library exists in Js (such as zip.js), but I