screen-scraping

Trouble Scraping Web Page With Malformed Content

若如初见. 提交于 2020-01-25 23:13:51
问题 I have written c# code which utilizes the HtmlAgilityPack library in order to scrape a page located at: World's Largest Urban Areas (Page 2). Unfortunately the page consists of malformed content. I'm at an impasse on how to scrape this page. The current code I have (appearing below) freezes on parsing the HTML: HtmlNodeCollection cityRecords = _htmlDocument.DocumentNode.SelectNodes("//table[@class='boldtable']//tr[position() != 1]"); CityNodes = (from node in cityRecords.Descendants() where

Trouble Scraping Web Page With Malformed Content

主宰稳场 提交于 2020-01-25 23:13:49
问题 I have written c# code which utilizes the HtmlAgilityPack library in order to scrape a page located at: World's Largest Urban Areas (Page 2). Unfortunately the page consists of malformed content. I'm at an impasse on how to scrape this page. The current code I have (appearing below) freezes on parsing the HTML: HtmlNodeCollection cityRecords = _htmlDocument.DocumentNode.SelectNodes("//table[@class='boldtable']//tr[position() != 1]"); CityNodes = (from node in cityRecords.Descendants() where

using curl to get from one webpage to another involving javascript

血红的双手。 提交于 2020-01-25 10:01:19
问题 I have webpage1.html which has a hyperlink whose href="some/javascript/function/outputLink()" Now, using curl (or any other method in php) how do I deduce the hyperlink (of http:// format) from the javascript function() so that I can go to next page. Thanks 回答1: You'd have to scrape the JavaScript. Figure out where the function is and see what URL it's using. Sometimes http:// is omitted for links that are on the same page, so that won't be a good search reference. At this point the only

Python lxml - returns null list

孤街浪徒 提交于 2020-01-25 05:58:02
问题 I cannot figure out what is wrong with the XPATH when trying to extract a value from a webpage table. The method seems correct as I can extract the page title and other attributes, but I cannot extract the third value, it always returns an empty list? from lxml import html import requests test_url = 'SC312226' page = ('https://www.opencompany.co.uk/company/'+test_url) print 'Now searching URL: '+page data = requests.get(page) tree = html.fromstring(data.text) print tree.xpath('//title/text()'

Python lxml - returns null list

只谈情不闲聊 提交于 2020-01-25 05:57:16
问题 I cannot figure out what is wrong with the XPATH when trying to extract a value from a webpage table. The method seems correct as I can extract the page title and other attributes, but I cannot extract the third value, it always returns an empty list? from lxml import html import requests test_url = 'SC312226' page = ('https://www.opencompany.co.uk/company/'+test_url) print 'Now searching URL: '+page data = requests.get(page) tree = html.fromstring(data.text) print tree.xpath('//title/text()'

Screen Scraping

岁酱吖の 提交于 2020-01-24 14:12:12
问题 Hi I'm trying to implement a screen scraping scenario on my website and have the following set so far. What I'm ultimately trying to do is replace all links in the $results variable that have "ResultsDetails.aspx?" to "results-scrape-details/" then output again. Can anyone point me in the right direction? <?php $url = "http://mysite:90/Testing/label/stuff/ResultsIndex.aspx"; $raw = file_get_contents($url); $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B"); $content = str_replace(

Python web scraping: difference between sleep and request(page, timeout=x)

旧时模样 提交于 2020-01-23 03:59:07
问题 When scraping multiple websites in a loop, I notice there is a rather large difference in speed between, sleep(10) response = requests.get(url) and, response = requests.get(url, timeout=10) That is, timeout is much faster. Moreover, for both set-ups I expected a scraping duration of at least 10 seconds per page before requesting the next page, but this is not the case. Why is there such a difference in speed? Why is the scraping duration per page less than 10 seconds? I now use

C# WebClient - View source question

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-19 13:13:30
问题 I'm using a C# WebClient to post login details to a page and read the all the results. The page I am trying to load includes flash (which, in the browser, translates into HTML). I'm guessing it's flash to avoid being picked up by search engines??? The flash I am interested in is just text (not an image/video) etc and when I "View Selection Source" in firefox I do actually see the text, within HTML, that I want to see. (Interestingly when I view the source for the whole page I do not see the

C# WebClient - View source question

孤街醉人 提交于 2020-01-19 13:12:09
问题 I'm using a C# WebClient to post login details to a page and read the all the results. The page I am trying to load includes flash (which, in the browser, translates into HTML). I'm guessing it's flash to avoid being picked up by search engines??? The flash I am interested in is just text (not an image/video) etc and when I "View Selection Source" in firefox I do actually see the text, within HTML, that I want to see. (Interestingly when I view the source for the whole page I do not see the

screen scraping in php problem

孤街浪徒 提交于 2020-01-17 05:19:20
问题 i had made a screen scraping module which works very fine but with certain limitations.now i want to remove those boundations,but i got so unpredictable and different error. Before anything goes in ur mind let me wat is actually hapening. Initially i used screen scraping to retrieve result for a set of keyword(search content) google's all search engine like co.in/co.uk/nl/de/com. But now i had to scrape the logic for multiple search engine and multiple keywords in a loop. Lets check out this