scraperwiki

lxml not working with django, scraperwiki

拈花ヽ惹草 提交于 2019-12-24 19:01:32
问题 I'm working on a django app that goes through Illinois' General Assembly website to scrape some pdfs. While deployed on my desktop it works fine until urllib2 times out. When I try to deploy on my Bluehost server, the lxml part of the code throws up an error. Any help would be appreciated. import scraperwiki from bs4 import BeautifulSoup import urllib2 import lxml.etree import re from django.core.management.base import BaseCommand from legi.models import Votes class Command(BaseCommand): def

Why isn't my KML feed working with Google Maps anymore?

烂漫一生 提交于 2019-12-24 07:04:16
问题 I'm really confused. I have a KML feed at https://views.scraperwiki.com/run/hackney_council_planning_kml_output/? ...Which worked perfectly fine with Google Maps up until a few weeks ago: http://maps.google.com/maps?q=https://views.scraperwiki.com/run/hackney_council_planning_kml_output/? Now it gives me a "file not found" error. The feed validates fine: http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fviews.scraperwiki.com%2Frun%2Fhackney_council_planning_kml_output%2F ...Any idea what

Parsing a numbered transcript into XML

别来无恙 提交于 2019-12-23 01:58:42
问题 I'm wanting to build a scraper that parses through transcripts from the Leveson Inquiry, which are in the following format as plaintext: 1 Thursday, 2 February 2012 2 (10.00 am) 3 LORD JUSTICE LEVESON: Good morning. 4 MR BARR: Good morning, sir. We're going to start today 5 with witnesses from the mobile phone companies, 6 Mr Blendis from Everything Everywhere, Mr Hughes from 7 Vodafone and Mr Gorham from Telefonica. 8 LORD JUSTICE LEVESON: Very good. 9 MR BARR: We're going to listen to them

ScraperWiki: How to create and add records with autoincrement key

拟墨画扇 提交于 2019-12-11 03:56:42
问题 Anyone know how to create a table with a surrogate key? looking for something like autoincrement, that is just a large integer that automatically adds the next highest unique number as the primary key. Need to know how to create the table as well as how to add records (preferably through scraperwiki.sqlite.save) Thanks! 回答1: This seems to work for me for the specific case, if not answering the more general one https://scraperwiki.com/scrapers/autoincr_demo Bonuses include not having to

Encoding error while parsing RSS with lxml

偶尔善良 提交于 2019-12-10 02:38:12
问题 I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError? request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml') response = urllib2.urlopen(request) response = response.read() encd = chardet.detect(response)['encoding'] parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd) tree = etree.parse(response, parser) But I get an error: tree = etree.parse(response, parser) File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml

Parsing a numbered transcript into XML

£可爱£侵袭症+ 提交于 2019-12-08 02:59:39
I'm wanting to build a scraper that parses through transcripts from the Leveson Inquiry , which are in the following format as plaintext: 1 Thursday, 2 February 2012 2 (10.00 am) 3 LORD JUSTICE LEVESON: Good morning. 4 MR BARR: Good morning, sir. We're going to start today 5 with witnesses from the mobile phone companies, 6 Mr Blendis from Everything Everywhere, Mr Hughes from 7 Vodafone and Mr Gorham from Telefonica. 8 LORD JUSTICE LEVESON: Very good. 9 MR BARR: We're going to listen to them all together, sir. 10 Can I ask that the gentlemen are sworn in, please. 11 MR JAMES BLENDIS (affirmed

Encoding error while parsing RSS with lxml

时间秒杀一切 提交于 2019-12-05 02:34:05
I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError? request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml') response = urllib2.urlopen(request) response = response.read() encd = chardet.detect(response)['encoding'] parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd) tree = etree.parse(response, parser) But I get an error: tree = etree.parse(response, parser) File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594) File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c

Screenscaping aspx with Python Mechanize - Javascript form submission

£可爱£侵袭症+ 提交于 2019-11-29 14:42:11
问题 I'm trying to scrape UK Food Ratings Agency data aspx seach results pages (e.,g http://ratings.food.gov.uk/QuickSearch.aspx?q=po30 ) using Mechanize/Python on scraperwiki ( http://scraperwiki.com/scrapers/food_standards_agency/ ) but coming up with a problem when trying to follow "next" page links which have the form: <input type="submit" name="ctl00$ContentPlaceHolder1$uxResults$uxNext" value="Next >" id="ctl00_ContentPlaceHolder1_uxResults_uxNext" title="Next >" /> The form handler looks

How to install Poppler on Windows

可紊 提交于 2019-11-27 01:36:20
问题 The most recent version of scraperwiki depends on Poppler (or so the github says). Unfortunately it only specifies how to get it on OSX and Linux, not Windows. A quick google turned up nothing too promising so does anyone know how to get Poppler on Windows for scraperwiki? 回答1: Poppler Windows binaries are available from ftp://ftp.gnome.org/Public/GNOME/binaries/win32/dependencies/ -- but note that those aren't quite up-to-date. If you're looking for Python (2.7) bindings (as this question's