I am trying to extract the first and third columns of this data table using BeautifulSoup. From looking at the HTML the first column has a You can try this code: As you can see the code just connects to the url and gets the html, and the BeautifulSoup finds the first table, then all the 'tr' and selects the first column, which is the 'th', and the third column, which is a 'td'. In addition to @jonhkr's answer I thought I'd post an alternate solution I came up with. Unlike jonhkr's answer, which dials into the webpage, mine assumes that you have it save on your computer and pass it as a command line argument. For example: you can try this code also tag. The o
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())
for row in soup.findAll('table')[0].tbody.findAll('tr'):
first_column = row.findAll('th')[0].contents
third_column = row.findAll('td')[2].contents
print first_column, third_column
#!/usr/bin/python
from BeautifulSoup import BeautifulSoup
from sys import argv
filename = argv[1]
#get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody
data = map(lambda x: (x.findAll(text=True)[1],x.findAll(text=True)[5]),table.findAll('tr'))
print data
python file.py table.html
import requests
from bs4 import BeautifulSoup
page =requests.get("http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm")
soup = BeautifulSoup(page.content, 'html.parser')
for row in soup.findAll('table')[0].tbody.findAll('tr'):
first_column = row.findAll('th')[0].contents
third_column = row.findAll('td')[2].contents
print (first_column, third_column)