LXML unable to retrieve webpage with error “failed to load HTTP resource”

北战南征 提交于 2019-12-12 04:12:46

问题


Hi so I tried opening the link below in a browser and it works but not in the code. The link is actually a combination of a news site and then the extension of the article called from another file url.txt. I tried the code with a normal website (www.google.com) and it works perfectly.

import sys
import MySQLdb
from mechanize import Browser
from bs4 import BeautifulSoup, SoupStrainer
from nltk import word_tokenize
from nltk.tokenize import *
import urllib2
import nltk, re, pprint
import mechanize #html form filling
import lxml.html

with open("url.txt","r") as f:
    first_line = f.readline()
#print first_line
url = "http://channelnewsasia.com/&s" + (first_line)
t = lxml.html.parse(url)
print t.find(".//title").text

And this is the error I am getting.

And this is the content of url.txt

/news/asiapacific/australia-to-send-armed/1284790.html


回答1:


This is because of the &s part of the url - it is definitely not needed:

url = "http://channelnewsasia.com" + first_line

Also, url parts are better be joined using urljoin():

from urlparse import urljoin
import lxml.html

BASE_URL = "http://channelnewsasia.com" 

with open("url.txt") as f:
    first_line = f.readline()

url = urljoin(BASE_URL, first_line)
t = lxml.html.parse(url)
print t.find(".//title").text

prints:

Australia to send armed personnel to MH17 site - Channel NewsAsia


来源:https://stackoverflow.com/questions/25007501/lxml-unable-to-retrieve-webpage-with-error-failed-to-load-http-resource

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!