web-crawler

Why python print is delayed?

巧了我就是萌 提交于 2020-01-17 02:53:10
问题 I am trying to download file using requests, and print a dot every time retrieve 100k size of file, but all the dots is printed out at the end. See code. with open(file_name,'wb') as file: print("begin downloading, please wait...") respond_file = requests.get(file_url,stream=True) size = len(respond_file.content)//1000000 #the next line will not be printed until file is downloaded print("the file size is "+ str(size) +"MB") for chunk in respond_file.iter_content(102400): file.write(chunk)

Why python print is delayed?

最后都变了- 提交于 2020-01-17 02:53:07
问题 I am trying to download file using requests, and print a dot every time retrieve 100k size of file, but all the dots is printed out at the end. See code. with open(file_name,'wb') as file: print("begin downloading, please wait...") respond_file = requests.get(file_url,stream=True) size = len(respond_file.content)//1000000 #the next line will not be printed until file is downloaded print("the file size is "+ str(size) +"MB") for chunk in respond_file.iter_content(102400): file.write(chunk)

WebRequest.Create - The operation has timed out

岁酱吖の 提交于 2020-01-16 18:16:29
问题 I'm trying to crawl a couple of pages on my own site, but I'm getting a time-out webException("The operation has timed out") on my live environment but not on my test environment. The time-out does not occur on the same page twice, but randomly and often after some requests. After the first time-out, the frequency of the time-outs rises. The requestUristring on test enviroment: http://localhost/Opgaver/Flytning/Haarde-hvidevarer/Bortkoersel-amerikaner-koeleskab-paa.aspx The requestUristring

WebRequest.Create - The operation has timed out

喜夏-厌秋 提交于 2020-01-16 18:16:08
问题 I'm trying to crawl a couple of pages on my own site, but I'm getting a time-out webException("The operation has timed out") on my live environment but not on my test environment. The time-out does not occur on the same page twice, but randomly and often after some requests. After the first time-out, the frequency of the time-outs rises. The requestUristring on test enviroment: http://localhost/Opgaver/Flytning/Haarde-hvidevarer/Bortkoersel-amerikaner-koeleskab-paa.aspx The requestUristring

scrapy a weird bug code that can't call pipeline

主宰稳场 提交于 2020-01-16 08:51:10
问题 I write a small spider, when I run and it can't call pipeline. After debug for a while, I find the bug code area. The logic of the spider is that I crawl the first url to fetch cookie, then I crawl the second url to download the code picture with cookie, and I post some data I prepare to the third url. And If the text I get from the picture wrong then I download again to post the third url repeatedly, until I got the right text. Let me show you the code: # -*- coding: gbk -*- import scrapy

Download all PDF files from crawled links

倾然丶 夕夏残阳落幕 提交于 2020-01-16 08:27:33
问题 While running code it says that ProductListPage is null and after dropping an error does not proceed forward. Any ideas how to solve this issue? Wait until //div[@class='productContain padb6']//div[@class='large-4 medium-4 columns']/a is found or something else? Here is my current code: HtmlDocument htmlDoc = new HtmlWeb().Load("https://example.com/"); HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//div[@class='productContain padb6']//div[@class='large-4 medium-4

Identifying a Search Engine Crawler

和自甴很熟 提交于 2020-01-16 04:48:07
问题 I am working on a website which loads its data via AJAX. I also want that the whole website can be crawled by search engines like google and yahoo. I want to make 2 versions of the site... [1] When a user comes the hyperlinks should work just like GMAIL (#'ed hyperlinks) [2] When a crawler comes the hyperlinks should work normally (AJAX mode off) How can i identify a Crawler?? 回答1: You should not present a different form of your website to your users and a crawler. If Google discovers you

jsoup to login to a webite

二次信任 提交于 2020-01-16 01:19:12
问题 I am trying to use jsoup to get information after logging into "http://pawscas.usask.ca/cas-web/login". I've tried what's below and it doesn't seem to work, any help would be appreciated, thanks. Connection.Response res = null; try { res = Jsoup.connect("http://pawscas.usask.ca/cas-web/login") .data("username", "user") .data("password", "pass") //.data("It", "some data") //.data("execution", "some data") //.data("_eventId", "submit") .method(Method.POST) .execute(); } catch (IOException e) {

Date Format getting disturb when creating .CSV file in Java

偶尔善良 提交于 2020-01-16 01:10:30
问题 I am creating a web scraper and then store the data in the .CSV file. My program is running fine but, there is a problem that the website from where I am retrieving data have a date which is in (Month Day, Year) format. So when I save the data in .CSV file it will consider the Year as another column due to which all the data gets manipulated. I actually want to store that data into (MM-MON-YYYY) and store Validity date in one column. I am posting my code below. Kindly, help me out. Thanks! P

Selenium with PhantomJS: Form being validated but not submitted

别说谁变了你拦得住时间么 提交于 2020-01-15 18:50:15
问题 I'm having a strange problem submitting a form through Selenium Webdriver's PhantomJS API. Upon clicking the submit button, the form gets validated (are the username and password too short, or blank, etc.), but it does not get ultimately submitted. That is, if I submit an invalid form, and check the screenshot, there are alert notifications. If I submit a valid form, nothing happens. The JS on the page is supposed to validate the form, then submit it, when the submit button is clicked. A