web-crawler | 易学教程

Scrapy get all links from any website

阅读更多关于 Scrapy get all links from any website

问题 I have the following code for a web crawler in Python 3: import requests from bs4 import BeautifulSoup import re def get_links(link): return_links = [] r = requests.get(link) soup = BeautifulSoup(r.content, "lxml") if r.status_code != 200: print("Error. Something is wrong here") else: for link in soup.findAll('a', attrs={'href': re.compile("^http")}): return_links.append(link.get('href'))) def recursive_search(links) for i in links: links.append(get_links(i)) recursive_search(links) recursive

How to add proxies to BeautifulSoup crawler

阅读更多关于 How to add proxies to BeautifulSoup crawler

问题 These are the definitions in the python crawler: from __future__ import with_statement from eventlet.green import urllib2 import eventlet import re import urlparse from bs4 import BeautifulSoup, SoupStrainer import sqlite3 import datetime How to I add a rotating proxy (one proxy per open thread) to a recursive cralwer working on BeautifulSoup? I know how to add proxies if I was using Mechanise's browser: br = Browser() br.set_proxies({'http':'http://username:password@proxy:port', 'https':

Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

阅读更多关于 Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

问题 This is my scrapy code. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse import urljoin import pymongo import time class CompItem(scrapy.Item): text = scrapy.Field() name = scrapy.Field() date = scrapy.Field() url = scrapy.Field() rating = scrapy.Field() title = scrapy.Field() category = scrapy.Field() source = scrapy.Field() user_info = scrapy.Field() email =

Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

阅读更多关于 Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

阅读更多关于 Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

How to set different IP according to different commands of one single scrapy.Spider?

阅读更多关于 How to set different IP according to different commands of one single scrapy.Spider?

问题 I have a bunch of pages to scrape, about 200 000. I usually use Tor and Polipo proxy to hide my spiders behaviors even if they are polite, we never know. So if I login this is useless to use one account and change IP. So that is why I can create several accounts on the website and to set my spider with arguments like in the following: class ASpider(scrapy.Spider): name = "spider" start_urls = ['https://www.a_website.com/compte/login'] def __init__ (self, username=None, password=None): self

Crawling IMDB for movie trailers?

阅读更多关于 Crawling IMDB for movie trailers?

问题 I want to crawl IMDB and download the trailers of movies (either from YouTube or IMDB) that fit some criteria (e.g.: released this year, with a rating above 2). I want to do this in Python - I saw that there were packages for crawling IMDB and downloading YouTube videos. The thing is, my current plan is to crawl IMDB and then search youtube for '$movie_name' + 'trailer' and hope that the top result is the trailer, and then download it. Still, this seems a bit convoluted and I was wondering if

How to limit scrapy request objects?

阅读更多关于 How to limit scrapy request objects?

问题 So I have a spider that I thought was leaking memory, turns out it is just grabbing too many links from link rich pages (sometimes it puts upwards of 100,000) when I check the telnet console >>> prefs() Now I have been over the docs and google again and again and I can't find a way to limit the requests that the spider takes in. What I want is to be able to tell it to hold back on taking requests once a certain amount goes into the scheduler. I have tried setting a DEPTH_LIMIT but that only

Nutch - does not crawl, says “Stopping at depth=1 - no more URLs to fetch”

阅读更多关于 Nutch - does not crawl, says “Stopping at depth=1 - no more URLs to fetch”

问题 It's been long since I've been trying to crawl using Nutch but it just doesn't seem to run. I'm trying to build a SOLR search for a website and using Nutch for crawling and indexing in Solr. There have been some permission problems originally but they have been fixed now. The URL I'm trying to crawl is http://172.30.162.202:10200/ , which is not publicly accessible. It is an internal URL that can be reached from the Solr server. I tried browsing it using Lynx. Given below is the output from

Invalid url's throw an exception - python

阅读更多关于 Invalid url's throw an exception - python

问题 import httplib import urlparse def getUrl(url): try: parts = urlparse.urlsplit(url) server = parts[1] path = parts[2] obj = httplib.HTTPConnection(server,80) obj.connect() obj.putrequest('HEAD',path) obj.putheader('Accept','*/*') obj.endheaders() response = obj.getresponse() contentType = response.getheader("content-type", "unknown") obj.close() if response.status !=200: print 'Error' else: print 'Awesome' except Exception, e: print e I wrote the code above to check if a given URL is valid or