scraperwiki | 易学教程

lxml not working with django, scraperwiki

阅读更多关于 lxml not working with django, scraperwiki

问题 I'm working on a django app that goes through Illinois' General Assembly website to scrape some pdfs. While deployed on my desktop it works fine until urllib2 times out. When I try to deploy on my Bluehost server, the lxml part of the code throws up an error. Any help would be appreciated. import scraperwiki from bs4 import BeautifulSoup import urllib2 import lxml.etree import re from django.core.management.base import BaseCommand from legi.models import Votes class Command(BaseCommand): def

Why isn't my KML feed working with Google Maps anymore?

阅读更多关于 Why isn't my KML feed working with Google Maps anymore?

问题 I'm really confused. I have a KML feed at https://views.scraperwiki.com/run/hackney_council_planning_kml_output/? ...Which worked perfectly fine with Google Maps up until a few weeks ago: http://maps.google.com/maps?q=https://views.scraperwiki.com/run/hackney_council_planning_kml_output/? Now it gives me a "file not found" error. The feed validates fine: http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fviews.scraperwiki.com%2Frun%2Fhackney_council_planning_kml_output%2F ...Any idea what

Parsing a numbered transcript into XML

阅读更多关于 Parsing a numbered transcript into XML

问题 I'm wanting to build a scraper that parses through transcripts from the Leveson Inquiry, which are in the following format as plaintext: 1 Thursday, 2 February 2012 2 (10.00 am) 3 LORD JUSTICE LEVESON: Good morning. 4 MR BARR: Good morning, sir. We're going to start today 5 with witnesses from the mobile phone companies, 6 Mr Blendis from Everything Everywhere, Mr Hughes from 7 Vodafone and Mr Gorham from Telefonica. 8 LORD JUSTICE LEVESON: Very good. 9 MR BARR: We're going to listen to them

ScraperWiki: How to create and add records with autoincrement key

阅读更多关于 ScraperWiki: How to create and add records with autoincrement key

问题 Anyone know how to create a table with a surrogate key? looking for something like autoincrement, that is just a large integer that automatically adds the next highest unique number as the primary key. Need to know how to create the table as well as how to add records (preferably through scraperwiki.sqlite.save) Thanks! 回答1: This seems to work for me for the specific case, if not answering the more general one https://scraperwiki.com/scrapers/autoincr_demo Bonuses include not having to

Encoding error while parsing RSS with lxml

阅读更多关于 Encoding error while parsing RSS with lxml

问题 I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError? request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml') response = urllib2.urlopen(request) response = response.read() encd = chardet.detect(response)['encoding'] parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd) tree = etree.parse(response, parser) But I get an error: tree = etree.parse(response, parser) File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml

Parsing a numbered transcript into XML

阅读更多关于 Parsing a numbered transcript into XML

I'm wanting to build a scraper that parses through transcripts from the Leveson Inquiry , which are in the following format as plaintext: 1 Thursday, 2 February 2012 2 (10.00 am) 3 LORD JUSTICE LEVESON: Good morning. 4 MR BARR: Good morning, sir. We're going to start today 5 with witnesses from the mobile phone companies, 6 Mr Blendis from Everything Everywhere, Mr Hughes from 7 Vodafone and Mr Gorham from Telefonica. 8 LORD JUSTICE LEVESON: Very good. 9 MR BARR: We're going to listen to them all together, sir. 10 Can I ask that the gentlemen are sworn in, please. 11 MR JAMES BLENDIS (affirmed

Encoding error while parsing RSS with lxml

阅读更多关于 Encoding error while parsing RSS with lxml

I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError? request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml') response = urllib2.urlopen(request) response = response.read() encd = chardet.detect(response)['encoding'] parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd) tree = etree.parse(response, parser) But I get an error: tree = etree.parse(response, parser) File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594) File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c

Screenscaping aspx with Python Mechanize - Javascript form submission

阅读更多关于 Screenscaping aspx with Python Mechanize - Javascript form submission

问题 I'm trying to scrape UK Food Ratings Agency data aspx seach results pages (e.,g http://ratings.food.gov.uk/QuickSearch.aspx?q=po30 ) using Mechanize/Python on scraperwiki ( http://scraperwiki.com/scrapers/food_standards_agency/ ) but coming up with a problem when trying to follow "next" page links which have the form: <input type="submit" name="ctl00$ContentPlaceHolder1$uxResults$uxNext" value="Next >" id="ctl00_ContentPlaceHolder1_uxResults_uxNext" title="Next >" /> The form handler looks

How to install Poppler on Windows

阅读更多关于 How to install Poppler on Windows

问题 The most recent version of scraperwiki depends on Poppler (or so the github says). Unfortunately it only specifies how to get it on OSX and Linux, not Windows. A quick google turned up nothing too promising so does anyone know how to get Poppler on Windows for scraperwiki? 回答1: Poppler Windows binaries are available from ftp://ftp.gnome.org/Public/GNOME/binaries/win32/dependencies/ -- but note that those aren't quite up-to-date. If you're looking for Python (2.7) bindings (as this question's