Extracting selected columns from a table using BeautifulSoup

前端未结

关注

 3  1456

I am trying to extract the first and third columns of this data table using BeautifulSoup. From looking at the HTML the first column has a tag. The o

相关标签:

3条回答

我寻月下人不归

2020-12-05 16:07

You can try this code:

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())

for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print first_column, third_column

As you can see the code just connects to the url and gets the html, and the BeautifulSoup finds the first table, then all the 'tr' and selects the first column, which is the 'th', and the third column, which is a 'td'.

0 讨论(0)

醉梦人生

2020-12-05 16:22

In addition to @jonhkr's answer I thought I'd post an alternate solution I came up with.

 #!/usr/bin/python

 from BeautifulSoup import BeautifulSoup
 from sys import argv

 filename = argv[1]
 #get HTML file as a string
 html_doc = ''.join(open(filename,'r').readlines())
 soup = BeautifulSoup(html_doc)
 table = soup.findAll('table')[0].tbody

 data = map(lambda x: (x.findAll(text=True)[1],x.findAll(text=True)[5]),table.findAll('tr'))
 print data

Unlike jonhkr's answer, which dials into the webpage, mine assumes that you have it save on your computer and pass it as a command line argument. For example:

python file.py table.html

0 讨论(0)

一生所求

2020-12-05 16:29

you can try this code also

import requests
from bs4 import BeautifulSoup
page =requests.get("http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm")
soup = BeautifulSoup(page.content, 'html.parser')
for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print (first_column, third_column)

0 讨论(0)