问题
I am trying to extract the price data (high and low) from the 3rd table (corn). The code is return "None":
import urllib2
from bs4 import BeautifulSoup
import time
import re
start_urls = 4539
nb_quotes = 10
for urls in range (start_urls, start_urls - nb_quotes, -1):
start_time = time.time()
# construct the URLs strings
url = 'http://markets.iowafarmbureau.com/markets/fixed.php?page=egrains'
# Read the HTML page content
page = urllib2.urlopen(url)
# Create a beautifulsoup object
soup = BeautifulSoup(page)
# Search the table to be parsed in the whole HTML code
tables = soup.findAll('table')
tab = tables[2] # This is the table to be parsed
low_tmp = str(tab.findAll('tr')[0].findAll('td')[1].getText()) #Low price
low = re.sub('[+]', '', low_tmp)
high_tmp = str(tab.findAll('tr')[0].findAll('td')[2].string) # High price
high = re.sub('[+]', '', high_tmp)
stop_time = time.time()
print low, '\t', high, '(%0.1f s)' % (stop_time - start_time)
回答1:
The data in the table is filled up on the browser side using the following javascript call:
document.write(getQuoteboardHTML(
splitQuote(quotes, 'ZC*1,ZC*2,ZC*3,ZC*4,ZC*5,ZC*6,ZC*7,ZC*8,ZC*9'.split(/,/)),
'shortmonthonly,high,low,last,change'.split(/,/), { nospacers: true }));
BeautifulSoup
is an HTML parser - it would not execute javascript.
Basically, you need something to execute that javascript for you.
One solution would be to utilize a real browser with the help of selenium:
from selenium import webdriver
url = "http://markets.iowafarmbureau.com/markets/fixed.php?page=egrains"
driver = webdriver.Firefox()
driver.get(url)
table = driver.find_element_by_xpath('//td[contains(div[@class="fixedpage_heading"], "CORN")]/table[@class="homepage_quoteboard"]')
for row in table.find_elements_by_tag_name('tr')[1:]:
month = row.find_element_by_class_name('quotefield_shortmonthonly').text
low = row.find_element_by_class_name('quotefield_low').text
high = row.find_element_by_class_name('quotefield_high').text
print month, low, high
driver.close()
Prints:
SEP 329-0 338-0
DEC 335-6 345-4
MAR 348-2 358-0
MAY 356-6 366-0
JUL 364-0 373-4
SEP 372-0 379-4
DEC 382-0 390-2
MAR 392-4 399-0
MAY 400-0 405-0
Another option would be to "go down to metal" and see what splitQuote()
and getQuoteboardHTML()
js function actually do. Using browser developer tools, you can see that there is an underlying request going to this url, that returns a piece of javascript code containing all objects with the data for the tables on the page:
var quotes = { 'ZC*1': { name: 'Corn', flag: 's', price_2_close: '338.75', open_interest: '2701', tradetime: '20140911133000', symbol: 'ZCU14', open: '338', high: '338', low: '329', last: '331.75', change: '-7', pctchange: '-2.07', volume: '1623', exchange: 'CBOT', type: '2', unitcode: '-1', date: '14104 ... ', month: 'May 2015', shortmonth: 'May 2015' } };
If you manage to extract necessary parts from it - this would be your second option.
来源:https://stackoverflow.com/questions/25794935/beautifulsoup-scraping-td-tr