问题
I recently started working on a program in python which allows the user to conjugate any verb easily. To do this, I am using the urllib module to open the corresponding conjugations web page. For example, the verb "beber" would have the web page:
"http://www.spanishdict.com/conjugate/beber"
To open the page, I use the following python code:
source = urllib.urlopen("http://wwww.spanishdict.com/conjugate/beber").read()
This source does contain the information that I want to parse. But, when I make a BeautifulSoup object out of it like this:
soup = BeautifulSoup(source)
I appear to lose all the information I want to parse. The information lost when making the BeautifulSoup object usually looks something like this:
<tr>
<td class="verb-pronoun-row">
yo </td>
<td class="">
bebo </td>
<td class="">
bebí </td>
<td class="">
bebía </td>
<td class="">
bebería </td>
<td class="">
beberé </td>
</tr>
What am I doing wrong? I am no professional at Python or Web Parsing in general, so it may be a simple problem.
Here is my complete code (I used the "++++++" to differentiate the two):
import urllib
from bs4 import BeautifulSoup
source = urllib.urlopen("http://www.spanishdict.com/conjugate/beber").read()
soup = BeautifulSoup(source)
print source
print "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"
print str(soup)
回答1:
Your problem may be with encoding. I think that bs4
works with utf-8
and you have a different encoding set on your machine as default(an encoding that contains spanish letters). So urllib requests the page in your default encoding,thats okay so data is there in the source, it even prints out okay, but when you pass it to utf-8
based bs4
that characters are lost. Try looking for setting a different encoding in bs4
and if possible set it to your default. This is just a guess though, take it easy.
I recommend using regular expressions
. I have used them for all my web crawlers. If this is usable for you depends on the dynamicity of the website. But that problem is there even when you use bs4
. You just write all your re
manually and let it do the magic. You would have to work with the bs4
similiar way when looking foor information you want.
回答2:
When I wrote parsers I've had problems with bs, in some cases, it didn't find that found lxml and vice versa, because of broken html. Try to use lxml.html.
来源:https://stackoverflow.com/questions/15044563/parsing-web-pages-search-results-with-python