Why is BeautifulSoup not finding a specific table class? [closed]

后端未结

关注

 2  1276

生来不讨喜

相关标签:

2条回答

小鲜肉

2020-12-10 23:37
The page uses broken HTML, and different parsers will try to repair it differently. Install the lxml parser, it parses that page better:
```
>>> BeautifulSoup(html, 'html.parser').find("div",{"id":"cntPos"}).find("table",{"class":"cntTb"}).tbody.find_all("tr")[1].find("td",{"class":"cntBoxGreyLnk"}) is None
True
>>> BeautifulSoup(html, 'lxml').find("div",{"id":"cntPos"}).find("table",{"class":"cntTb"}).tbody.find_all("tr")[1].find("td",{"class":"cntBoxGreyLnk"}) is None
False
```
This doesn't mean that lxml will handle all broken HTML better than the other parser options. Also look at html5lib, a pure-Python implementation of the WHATWG HTML spec and thus more closely follows how current browser implementations handle broken HTML.
0 讨论(0)
发布评论:

提交评论
- 加载中...
予麋鹿

2020-12-10 23:57
Looking at the page source:
```
<td class="cntBoxGreyLnk" rowspan="2" valign="top">
    <script type="text/javascript" src="http://www.oil-price.net/COMMODITIES/gen.php?lang=en"></script>
    <noscript> To get live <a href="http://www.oil-price.net/dashboard.php?lang=en#COMMODITIES">gold, oil and commodity price</a>, please enable Javascript.</noscript>
```
the data you want is dynamically loaded into the page; you can't get it with BeautifulSoup because it doesn't exist in the HTML.

If you look at the linked script url at http://www.oil-price.net/COMMODITIES/gen.php?lang=en you see a bunch of javascript like
```
document.writeln('<table summary=\"Crude oil and commodity prices (c) http://oil-price.net\" style=\"font-family: Lucida Sans Unicode, Lucida Grande, Sans-Serif; font-size: 12px; background: #fff; border-collapse: collapse; text-align: left; border-color: #6678b1; border-width: 1px 1px 1px 1px; border-style: solid;\">');
document.writeln('<thead>');
/* ... */
document.writeln('<tr>');
document.writeln('<td style=\"font-size: 12px; font-weight: bold; border-bottom: 1px solid #ccc; color: #1869bd; padding: 2px 6px; white-space: nowrap;\">');
document.writeln('<a href=\"http://oil-price.net/dashboard.php?lang=en#COMMODITIES\"  style=\"color: #1869bd; text-decoration:none\">Heating Oil<\/a>');
document.writeln('<\/td>');
document.writeln('<td style=\"font-size: 12px; font-weight: normal; border-bottom: 1px solid #ccc; color: #000000; padding: 2px 6px; white-space: nowrap;\">');
document.writeln('3.05');
document.writeln('<\/td>');
document.writeln('<td style=\"font-size: 12px; font-weight: normal; border-bottom: 1px solid #ccc; color: green;    padding: 2px 6px; white-space: nowrap;\">');
document.writeln('+1.81%');
document.writeln('<\/td><\/tr>');
```
When the page is loaded, this javascript is run and dynamically writes in the values you are looking for. (As an aside: this is a completely archaic, denigrated, and generally horrible way of doing things; I can only presume someone thinks of it as an extra layer of security. They deserve to be punished for their temerity!).

Now, this code is pretty straight-forward; you could probably grab the html data with a regular expression. But (a) there are some escape-codes which could cause problems, (b) there's no guarantee they couldn't obfuscate their code in future, and (c) where's the fun in that?

The PyV8 module provides a straight-forward method of executing javascript code from Python, and even allows us to write javascript-callable Python code! We will take advantage of that to get the data in a non-obfuscatable way:
```
import PyV8
import requests
from bs4 import BeautifulSoup

SCRIPT = "http://www.oil-price.net/COMMODITIES/gen.php?lang=en"

class Document:
    def __init__(self):
        self.lines = []

    def writeln(self, s):
        self.lines.append(s)

    @property
    def content(self):
        return '\n'.join(self.lines)

class DOM(PyV8.JSClass):
    def __init__(self):
        self.document = Document()

def main():
    # Create a javascript context which contains
    #   a document object having a writeln method.
    # This allows us to capture the calls to document.writeln()
    dom  = DOM()
    ctxt = PyV8.JSContext(dom)
    ctxt.enter()

    # Grab the javascript and execute it
    js = requests.get(SCRIPT).content
    ctxt.eval(js)

    # The result is the HTML code you are looking for
    html = dom.document.content

    # html is now "<table> ... </table>" containing the data you are after;
    # you can go ahead and finish parsing it with BeautifulSoup
    tbl = BeautifulSoup(html)
    for row in tbl.findAll('tr'):
        print(' / '.join(td.text.strip() for td in row.findAll('td')))

if __name__ == "__main__":
    main()
```
This results in:
```
Crude Oil / 99.88 / +2.04%
Natural Gas / 4.78 / -3.27%
Gasoline / 2.75 / +2.40%
Heating Oil / 3.05 / +1.81%
Gold / 1263.30 / +0.45%
Silver / 19.92 / +0.06%
Copper / 3.27 / +0.37%
```
which is the data you wanted.

Edit: I can't really dumb it down any more; it's the dead minimum code that does the job. But maybe I can better explain how it works (it's really not as scary as it looks!):

The PyV8 module wraps Google's V8 javascript interpreter in such a way that Python can interact with it. You will need to go to https://code.google.com/p/pyv8/downloads/list to download and install the appropriate version before you can run my code.

The javascript language, by itself, doesn't know anything about how to interact with the outside world; it has no built-in input or output methods. This is not terribly useful. To solve this, we can pass in a 'context object' which contains information about the outside world and how to interact with it. When javascript is run in a web browser, it gets a context object which provides all sorts of information about the browser and the current web page and how to interact with them.

The javascript code from http://www.oil-price.net/COMMODITIES/gen.php?lang=en assumes that it will be run in a browser, where the context has a "document" object representing the web page, which has a "writeln" method which appends text to the current end of the web page. As the page is loaded, the script gets loaded and runs; it writes text (which just happens to be valid HTML) into the page; this gets rendered as part of the page, ending up as the Commodities table you wanted. You can't get the table with BeautifulSoup because the table doesn't exist until the javascript runs, and BeautifulSoup doesn't load or run javascript.

We want to run the javascript; to do so, we need a fake browser context which has a "document" object with a "writeln" method. Then we need to store the information that gets passed to "writeln", and we need a way to get it back when the script is finished. My DOM class is the fake browser context; when instantiated (ie when we make one of them), it gives itself a Document object called document, which has a writeln method. When document.writeln is called, it appends the line of text to document.lines, and at any time we can call document.content to get back all text written so far.

Now: the action! In the main function, we create a fake browser context, set it as the interpreter's current context, and start the interpreter. We grab the javascript code, and tell the interpreter to evaluate (ie run) it. (Source-code obfuscation, which can screw up static analysis, will not affect us because the code has to produce good output when it runs, and we are actually running it!) Once the code is finished, we get the final output from document.context; this is the table html which you were unable to get. We pass that back into BeautifulSoup to pull out the data, then print the data.

Hope that helps!
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题