Parsing Web Page's Search Results With Python

问题

I recently started working on a program in python which allows the user to conjugate any verb easily. To do this, I am using the urllib module to open the corresponding conjugations web page. For example, the verb "beber" would have the web page:

"http://www.spanishdict.com/conjugate/beber"

To open the page, I use the following python code:

source = urllib.urlopen("http://wwww.spanishdict.com/conjugate/beber").read()

This source does contain the information that I want to parse. But, when I make a BeautifulSoup object out of it like this:

soup = BeautifulSoup(source)

I appear to lose all the information I want to parse. The information lost when making the BeautifulSoup object usually looks something like this:

<tr>
      <td class="verb-pronoun-row">
    yo      </td>
                        <td class="">
      bebo        </td>
                          <td class="">
      bebí        </td>
                          <td class="">
      bebía        </td>
                          <td class="">
      bebería        </td>
                          <td class="">
      beberé        </td>
        </tr>

What am I doing wrong? I am no professional at Python or Web Parsing in general, so it may be a simple problem.

Here is my complete code (I used the "++++++" to differentiate the two):

import urllib
from bs4 import BeautifulSoup

source = urllib.urlopen("http://www.spanishdict.com/conjugate/beber").read()
soup = BeautifulSoup(source)

print source
print "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"
print str(soup)

回答1:

Your problem may be with encoding. I think that bs4 works with utf-8 and you have a different encoding set on your machine as default(an encoding that contains spanish letters). So urllib requests the page in your default encoding,thats okay so data is there in the source, it even prints out okay, but when you pass it to utf-8 based bs4 that characters are lost. Try looking for setting a different encoding in bs4 and if possible set it to your default. This is just a guess though, take it easy.

I recommend using regular expressions. I have used them for all my web crawlers. If this is usable for you depends on the dynamicity of the website. But that problem is there even when you use bs4. You just write all your re manually and let it do the magic. You would have to work with the bs4 similiar way when looking foor information you want.

回答2:

When I wrote parsers I've had problems with bs, in some cases, it didn't find that found lxml and vice versa, because of broken html. Try to use lxml.html.

来源：https://stackoverflow.com/questions/15044563/parsing-web-pages-search-results-with-python

标签

python

parsing

web

beautifulsoup

urllib