web-crawler

How can i crawl web data that not in tags

Deadly 提交于 2020-01-24 21:30:06
问题 <div id="main-content" class="content"> <div class="metaline"> <span class="article-meta author">jorden</span> </div> " 1.name:jorden> 2.age:28 -- " <span class="D2"> from 111.111.111.111 </span> </div> I only need 1.name:jorden 2.age:28 xxx.select('#main-content') this will return all things, but i only need part of them. Because they are not in any tags, i don't know how to do. 回答1: You want to find the tag before the text in question (in your case, <div class="metaline"> ) and then look at

scrapy spider not returning any results

好久不见. 提交于 2020-01-24 19:12:32
问题 This is my first attempt to create a spider, kindly spare me if I have not done it properly. Here is the link to the website I am trying to extract data from. http://www.4icu.org/in/. I want the entire list of colleges that is being displayed on the page. But when I run the following spider I am returned with an empty json file. my items.py import scrapy class CollegesItem(scrapy.Item): # define the fields for your item here like: link = scrapy.Field() This is the spider colleges.py import

scrapy spider not returning any results

雨燕双飞 提交于 2020-01-24 19:12:14
问题 This is my first attempt to create a spider, kindly spare me if I have not done it properly. Here is the link to the website I am trying to extract data from. http://www.4icu.org/in/. I want the entire list of colleges that is being displayed on the page. But when I run the following spider I am returned with an empty json file. my items.py import scrapy class CollegesItem(scrapy.Item): # define the fields for your item here like: link = scrapy.Field() This is the spider colleges.py import

running multiple threads in python, simultaneously - is it possible?

做~自己de王妃 提交于 2020-01-24 06:25:26
问题 I'm writing a little crawler that should fetch a URL multiple times, I want all of the threads to run at the same time (simultaneously). I've written a little piece of code that should do that. import thread from urllib2 import Request, urlopen, URLError, HTTPError def getPAGE(FetchAddress): attempts = 0 while attempts < 2: req = Request(FetchAddress, None) try: response = urlopen(req, timeout = 8) #fetching the url print "fetched url %s" % FetchAddress except HTTPError, e: print 'The server

how to get content from external web page by php?

我的未来我决定 提交于 2020-01-24 00:53:05
问题 I want get 'title'、'description'and'keywords' in a web page I know 3 ways to implement this job: a) use CURL b) use fopen c) use get_meta_data() Strangely,each of the above does not work correctly every time. for the same url: Sometimes,I can get the content. Sometimes,it return an error:'failed to open stream: HTTP request failed' I'm confused.WHY? help me : ) 回答1: You can use file_get_contents("http://someurl.com"); to fetch an external website. The result will be a string containing the

Crawl website using wget and limit total number of crawled links

佐手、 提交于 2020-01-23 11:14:27
问题 I want to learn more about crawlers by playing around with the wget tool. I'm interested in crawling my department's website, and finding the first 100 links on that site. So far, the command below is what I have. How do I limit the crawler to stop after 100 links? wget -r -o output.txt -l 0 -t 1 --spider -w 5 -A html -e robots=on "http://www.example.com" 回答1: You can't. wget doesn't support this so if you want something like this, you would have to write a tool yourself. You could fetch the

Python web crawler sometimes returns half of the source code, sometimes all of it… From the same website

爱⌒轻易说出口 提交于 2020-01-17 13:45:55
问题 I have a spreadsheet of patent numbers that I'm getting extra data for by scraping Google Patents, the USPTO website, and a few others. I mostly have it running, but there's one thing I've been stuck on all day. When I go for the USPTO site and get the source code it will sometimes give me the whole thing and work wonderfully, but other times it only gives me about the second half (and what I'm looking for is in the first). searched around here quite a bit, and I haven't seen anyone with this

php crawler for website with ajax content and https

江枫思渺然 提交于 2020-01-17 05:18:35
问题 i'm trying to grab the content of a website based on ajax and https but with no luck. Is this possible. The website i'm trying to crawl is this: https://www.bet3000.com/en/html/home.html#!https://www.bet3000.com/html/en/eventssportsbook.html?category_id=2117 Thanks 回答1: If you take a look at the HTTP requests that this page is doing (using, for example, Firebug for Firefox) , you'll notice it makes several Ajax requests. Instead of trying to execute the Javascript code, a possible solution

How to make dynamic links crawable through google

北城余情 提交于 2020-01-17 04:08:23
问题 I have question/answer website where each question has a link. My problem is how do I fed this link to google ? Should I write link in "site.xml" or "robot.xml" ? What is standard solution to this problem ?? Thanks Amit Aggarwal 回答1: Some advices: First make sure your website is SEO friendly and is crawl-able by search engines. Second make sure to publish your webpage site-map to Google. To do that add your site to Google Webmaster and submit your sitemap (XML, RSS, ATOM feed formats).

php crawler for wiki getting error [closed]

安稳与你 提交于 2020-01-17 04:06:27
问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 3 years ago . In the below code I am trying to extract the content from the website using the php code, which is working fine when I use getElementByIdAsString('www.abebooks.com/9780143418764/Love-Story-Singh-Ravinder-0143418769/plp', 'synopsis'); But it is not working when I use the same code to extract content