问题
I'm trying to scrape a table on an ajax page with Beautiful Soup and print it out in table form with the TextTable library.
import BeautifulSoup
import urllib
import urllib2
import getpass
import cookielib
import texttable
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
...
def show_queue():
url = 'https://www.animenfo.com/radio/nowplaying.php'
values = {'ajax' : 'true', 'mod' : 'queue'}
data = urllib.urlencode(values)
f = opener.open(url, data)
soup = BeautifulSoup.BeautifulSoup(f)
stable = soup.find('table')
table = texttable.Texttable()
header = stable.findAll('th')
header_text = []
for th in header:
header_append = th.find(text=True)
header.append(header_append)
table.header(header_text)
rows = stable.find('tr')
for tr in rows:
cells = []
cols = tr.find('td')
for td in cols:
cells_append = td.find(text=True)
cells.append(cells_append)
table.add_row(cells)
s = table.draw
print s
...
Although the URL for the HTML in question I'm trying to scrape is shown in the code, here is an example of it:
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<th>Artist - Title</th>
<th>Album</th>
<th>Album Type</th>
<th>Series</th>
<th>Duration</th>
<th>Type of Play</th>
<th>
<span title="...">Time to play</span>
</th>
</tr>
<tr>
<td class="row1">
<a href="..." class="songinfo">Song 1</a>
</td>
<td class="row1">
<a href="..." class="album_link">Album 1</a>
</td>
<td class="row1">...</td>
<td class="row1">
</td>
<td class="row1" style="text-align: center">
5:43
</td>
<td class="row1" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row1" style="text-align: center">
~0:00:00
</td>
</tr>
<tr>
<td class="row2">
<a href="..." class="songinfo">Song2</a>
</td>
<td class="row2">
<a href="..." class="album_link">Album 2</a>
</td>
<td class="row2">...</td>
<td class="row2">
</td>
<td class="row2" style="text-align: center">
6:16
</td>
<td class="row2" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row2" style="text-align: center">
~0:05:43
</td>
</tr>
<tr>
<td class="row1">
<a href="..." class="songinfo">Song 3</a>
</td>
<td class="row1">
<a href="..." class="album_link">Album 3</a>
</td>
<td class="row1">...</td>
<td class="row1">
</td>
<td class="row1" style="text-align: center">
4:13
</td>
<td class="row1" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row1" style="text-align: center">
~0:11:59
</td>
</tr>
<tr>
<td class="row2">
<a href="..." class="songinfo">Song 4</a>
</td>
<td class="row2">
<a href="..." class="album_link">Album 4</a>
</td>
<td class="row2">...</td>
<td class="row2">
</td>
<td class="row2" style="text-align: center">
5:34
</td>
<td class="row2" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row2" style="text-align: center">
~0:16:12
</td>
</tr>
<tr>
<td class="row1"><a href="..." class="songinfo">Song 5</a>
</td>
<td class="row1">
<a href="..." class="album_link">Album 5</a>
</td>
<td class="row1">...</td>
<td class="row1"></td>
<td class="row1" style="text-align: center">
4:23
</td>
<td class="row1" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row1" style="text-align: center">
~0:21:46
</td>
</tr>
<tr>
<td style="height: 5px;">
</td></tr>
<tr>
<td class="row2" style="font-style: italic; text-align: center;" colspan="5">There are x songs in the queue with a total length of x:y:z.</td>
</tr>
</tbody>
</table>
Whenever I try to run this script function, it aborts with TypeError: find() takes no keyword arguments
on the line header_append = th.find(text=True)
. I'm sort of stumped, as it seems that I'm doing what is shown in code examples and it seems it should work, yet it doesn't.
In short, how do I fix the code so that there is no TypeError and what am I doing wrong?
Edit: Articles and documentation that I referred to when writing the script:
- http://segfault.in/2010/07/parsing-html-table-in-python-with-beautifulsoup/
- http://oneau.wordpress.com/2010/05/30/simple-formatted-tables-in-python-with-texttable/
回答1:
The Basic Issue
The parser is behaving correctly. You are just using the same expressions to parse different types of elements.
Revised code
Here is a snippet, focusing only on returning scraped lists. Once you have the lists, you can format the text table easily:
import BeautifulSoup
def get_queue(data):
# Args:
# data: string, contains the html to be scraped
soup = BeautifulSoup.BeautifulSoup(data)
stable = soup.find('table')
header = stable.findAll('th')
headers = [ th.text for th in header ]
cells = [ ]
rows = stable.findAll('tr')
for tr in rows[1:-2]:
# Process the body of the table
row = []
td = tr.findAll('td')
row.append( td[0].find('a').text )
row.append( td[1].find('a').text )
row.extend( [ td.text for td in td[2:] ] )
cells.append( row )
footer = rows[-1].find('td').text
return headers, cells, footer
Output
headers
, cells
, and footer
, cells can now be fed into a texttable
formatting function:
import texttable
def show_table(headers, cells, footer):
retval = ''
table = texttable.Texttable()
table.header(headers)
for cell in cells:
table.add_row(cell)
retval = table.draw()
return retval + '\n' + footer
print show_table(headers, cells, footer)
+----------+----------+----------+----------+----------+----------+----------+
| Artist - | Album | Album | Series | Duration | Type of | Time to |
| Title | | Type | | | Play | play |
+==========+==========+==========+==========+==========+==========+==========+
| Song 1 | Album 1 | ... | | 5:43 | S.A.M. | ~0:00:00 |
+----------+----------+----------+----------+----------+----------+----------+
| Song2 | Album 2 | ... | | 6:16 | S.A.M. | ~0:05:43 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 3 | Album 3 | ... | | 4:13 | S.A.M. | ~0:11:59 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 4 | Album 4 | ... | | 5:34 | S.A.M. | ~0:16:12 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 5 | Album 5 | ... | | 4:23 | S.A.M. | ~0:21:46 |
+----------+----------+----------+----------+----------+----------+----------+
There are x songs in the queue with a total length of x:y:z.
回答2:
The reason you're getting the error TypeError: find() takes no keyword arguments
is because you are actually calling find()
on a string.
string find
find is a python string method that takes no keyword arguments. Example:
>>> 'hello'.find('l')
2
>>> 'hello'.find('l', foo='bar')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: find() takes no keyword arguments
beautifulsoup find
beautifulsoup's Tag also has a find method, which is what you were trying to use.
The bottom line
At some point in your code, you ended up calling the string find, when you wanted to be working with a Tag.
Python uses duck typing, which can cause confusion in cases like this.
来源:https://stackoverflow.com/questions/11756006/find-on-beautiful-soup-in-loop-returns-typeerror