web-crawler

Scrapy get all links from any website

假装没事ソ 提交于 2020-04-10 03:35:38
问题 I have the following code for a web crawler in Python 3: import requests from bs4 import BeautifulSoup import re def get_links(link): return_links = [] r = requests.get(link) soup = BeautifulSoup(r.content, "lxml") if r.status_code != 200: print("Error. Something is wrong here") else: for link in soup.findAll('a', attrs={'href': re.compile("^http")}): return_links.append(link.get('href'))) def recursive_search(links) for i in links: links.append(get_links(i)) recursive_search(links) recursive

How to add proxies to BeautifulSoup crawler

痴心易碎 提交于 2020-03-17 12:07:52
问题 These are the definitions in the python crawler: from __future__ import with_statement from eventlet.green import urllib2 import eventlet import re import urlparse from bs4 import BeautifulSoup, SoupStrainer import sqlite3 import datetime How to I add a rotating proxy (one proxy per open thread) to a recursive cralwer working on BeautifulSoup? I know how to add proxies if I was using Mechanise's browser: br = Browser() br.set_proxies({'http':'http://username:password@proxy:port', 'https':

Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

☆樱花仙子☆ 提交于 2020-03-01 20:42:13
问题 This is my scrapy code. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse import urljoin import pymongo import time class CompItem(scrapy.Item): text = scrapy.Field() name = scrapy.Field() date = scrapy.Field() url = scrapy.Field() rating = scrapy.Field() title = scrapy.Field() category = scrapy.Field() source = scrapy.Field() user_info = scrapy.Field() email =

Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

大憨熊 提交于 2020-03-01 20:42:10
问题 This is my scrapy code. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse import urljoin import pymongo import time class CompItem(scrapy.Item): text = scrapy.Field() name = scrapy.Field() date = scrapy.Field() url = scrapy.Field() rating = scrapy.Field() title = scrapy.Field() category = scrapy.Field() source = scrapy.Field() user_info = scrapy.Field() email =

Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

僤鯓⒐⒋嵵緔 提交于 2020-03-01 20:40:40
问题 This is my scrapy code. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse import urljoin import pymongo import time class CompItem(scrapy.Item): text = scrapy.Field() name = scrapy.Field() date = scrapy.Field() url = scrapy.Field() rating = scrapy.Field() title = scrapy.Field() category = scrapy.Field() source = scrapy.Field() user_info = scrapy.Field() email =

How to set different IP according to different commands of one single scrapy.Spider?

▼魔方 西西 提交于 2020-02-25 08:08:06
问题 I have a bunch of pages to scrape, about 200 000. I usually use Tor and Polipo proxy to hide my spiders behaviors even if they are polite, we never know. So if I login this is useless to use one account and change IP. So that is why I can create several accounts on the website and to set my spider with arguments like in the following: class ASpider(scrapy.Spider): name = "spider" start_urls = ['https://www.a_website.com/compte/login'] def __init__ (self, username=None, password=None): self

Crawling IMDB for movie trailers?

前提是你 提交于 2020-02-06 08:24:45
问题 I want to crawl IMDB and download the trailers of movies (either from YouTube or IMDB) that fit some criteria (e.g.: released this year, with a rating above 2). I want to do this in Python - I saw that there were packages for crawling IMDB and downloading YouTube videos. The thing is, my current plan is to crawl IMDB and then search youtube for '$movie_name' + 'trailer' and hope that the top result is the trailer, and then download it. Still, this seems a bit convoluted and I was wondering if

How to limit scrapy request objects?

社会主义新天地 提交于 2020-01-31 20:04:52
问题 So I have a spider that I thought was leaking memory, turns out it is just grabbing too many links from link rich pages (sometimes it puts upwards of 100,000) when I check the telnet console >>> prefs() Now I have been over the docs and google again and again and I can't find a way to limit the requests that the spider takes in. What I want is to be able to tell it to hold back on taking requests once a certain amount goes into the scheduler. I have tried setting a DEPTH_LIMIT but that only

Nutch - does not crawl, says “Stopping at depth=1 - no more URLs to fetch”

夙愿已清 提交于 2020-01-25 23:49:30
问题 It's been long since I've been trying to crawl using Nutch but it just doesn't seem to run. I'm trying to build a SOLR search for a website and using Nutch for crawling and indexing in Solr. There have been some permission problems originally but they have been fixed now. The URL I'm trying to crawl is http://172.30.162.202:10200/ , which is not publicly accessible. It is an internal URL that can be reached from the Solr server. I tried browsing it using Lynx. Given below is the output from

Invalid url's throw an exception - python

只愿长相守 提交于 2020-01-25 19:21:28
问题 import httplib import urlparse def getUrl(url): try: parts = urlparse.urlsplit(url) server = parts[1] path = parts[2] obj = httplib.HTTPConnection(server,80) obj.connect() obj.putrequest('HEAD',path) obj.putheader('Accept','*/*') obj.endheaders() response = obj.getresponse() contentType = response.getheader("content-type", "unknown") obj.close() if response.status !=200: print 'Error' else: print 'Awesome' except Exception, e: print e I wrote the code above to check if a given URL is valid or