web-crawler | 易学教程

Why python print is delayed?

阅读更多关于 Why python print is delayed?

问题 I am trying to download file using requests, and print a dot every time retrieve 100k size of file, but all the dots is printed out at the end. See code. with open(file_name,'wb') as file: print("begin downloading, please wait...") respond_file = requests.get(file_url,stream=True) size = len(respond_file.content)//1000000 #the next line will not be printed until file is downloaded print("the file size is "+ str(size) +"MB") for chunk in respond_file.iter_content(102400): file.write(chunk)

Why python print is delayed?

阅读更多关于 Why python print is delayed?

WebRequest.Create - The operation has timed out

阅读更多关于 WebRequest.Create - The operation has timed out

问题 I'm trying to crawl a couple of pages on my own site, but I'm getting a time-out webException("The operation has timed out") on my live environment but not on my test environment. The time-out does not occur on the same page twice, but randomly and often after some requests. After the first time-out, the frequency of the time-outs rises. The requestUristring on test enviroment: http://localhost/Opgaver/Flytning/Haarde-hvidevarer/Bortkoersel-amerikaner-koeleskab-paa.aspx The requestUristring

WebRequest.Create - The operation has timed out

阅读更多关于 WebRequest.Create - The operation has timed out

scrapy a weird bug code that can't call pipeline

阅读更多关于 scrapy a weird bug code that can't call pipeline

问题 I write a small spider, when I run and it can't call pipeline. After debug for a while, I find the bug code area. The logic of the spider is that I crawl the first url to fetch cookie, then I crawl the second url to download the code picture with cookie, and I post some data I prepare to the third url. And If the text I get from the picture wrong then I download again to post the third url repeatedly, until I got the right text. Let me show you the code: # -*- coding: gbk -*- import scrapy

Download all PDF files from crawled links

阅读更多关于 Download all PDF files from crawled links

问题 While running code it says that ProductListPage is null and after dropping an error does not proceed forward. Any ideas how to solve this issue? Wait until //div[@class='productContain padb6']//div[@class='large-4 medium-4 columns']/a is found or something else? Here is my current code: HtmlDocument htmlDoc = new HtmlWeb().Load("https://example.com/"); HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//div[@class='productContain padb6']//div[@class='large-4 medium-4

Identifying a Search Engine Crawler

阅读更多关于 Identifying a Search Engine Crawler

问题 I am working on a website which loads its data via AJAX. I also want that the whole website can be crawled by search engines like google and yahoo. I want to make 2 versions of the site... [1] When a user comes the hyperlinks should work just like GMAIL (#'ed hyperlinks) [2] When a crawler comes the hyperlinks should work normally (AJAX mode off) How can i identify a Crawler?? 回答1: You should not present a different form of your website to your users and a crawler. If Google discovers you

jsoup to login to a webite

阅读更多关于 jsoup to login to a webite

问题 I am trying to use jsoup to get information after logging into "http://pawscas.usask.ca/cas-web/login". I've tried what's below and it doesn't seem to work, any help would be appreciated, thanks. Connection.Response res = null; try { res = Jsoup.connect("http://pawscas.usask.ca/cas-web/login") .data("username", "user") .data("password", "pass") //.data("It", "some data") //.data("execution", "some data") //.data("_eventId", "submit") .method(Method.POST) .execute(); } catch (IOException e) {

Date Format getting disturb when creating .CSV file in Java

阅读更多关于 Date Format getting disturb when creating .CSV file in Java

问题 I am creating a web scraper and then store the data in the .CSV file. My program is running fine but, there is a problem that the website from where I am retrieving data have a date which is in (Month Day, Year) format. So when I save the data in .CSV file it will consider the Year as another column due to which all the data gets manipulated. I actually want to store that data into (MM-MON-YYYY) and store Validity date in one column. I am posting my code below. Kindly, help me out. Thanks! P

Selenium with PhantomJS: Form being validated but not submitted

阅读更多关于 Selenium with PhantomJS: Form being validated but not submitted

问题 I'm having a strange problem submitting a form through Selenium Webdriver's PhantomJS API. Upon clicking the submit button, the form gets validated (are the username and password too short, or blank, etc.), but it does not get ultimately submitted. That is, if I submit an invalid form, and check the screenshot, there are alert notifications. If I submit a valid form, nothing happens. The JS on the page is supposed to validate the form, then submit it, when the submit button is clicked. A