Extracting selected columns from a table using BeautifulSoup

前端 未结 3 1456
情话喂你
情话喂你 2020-12-05 15:48

I am trying to extract the first and third columns of this data table using BeautifulSoup. From looking at the HTML the first column has a tag. The o

相关标签:
3条回答
  • 2020-12-05 16:07

    You can try this code:

    import urllib2
    from BeautifulSoup import BeautifulSoup
    
    url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
    soup = BeautifulSoup(urllib2.urlopen(url).read())
    
    for row in soup.findAll('table')[0].tbody.findAll('tr'):
        first_column = row.findAll('th')[0].contents
        third_column = row.findAll('td')[2].contents
        print first_column, third_column
    

    As you can see the code just connects to the url and gets the html, and the BeautifulSoup finds the first table, then all the 'tr' and selects the first column, which is the 'th', and the third column, which is a 'td'.

    0 讨论(0)
  • 2020-12-05 16:22

    In addition to @jonhkr's answer I thought I'd post an alternate solution I came up with.

     #!/usr/bin/python
    
     from BeautifulSoup import BeautifulSoup
     from sys import argv
    
     filename = argv[1]
     #get HTML file as a string
     html_doc = ''.join(open(filename,'r').readlines())
     soup = BeautifulSoup(html_doc)
     table = soup.findAll('table')[0].tbody
    
     data = map(lambda x: (x.findAll(text=True)[1],x.findAll(text=True)[5]),table.findAll('tr'))
     print data
    

    Unlike jonhkr's answer, which dials into the webpage, mine assumes that you have it save on your computer and pass it as a command line argument. For example:

    python file.py table.html 
    
    0 讨论(0)
  • 2020-12-05 16:29

    you can try this code also

    import requests
    from bs4 import BeautifulSoup
    page =requests.get("http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm")
    soup = BeautifulSoup(page.content, 'html.parser')
    for row in soup.findAll('table')[0].tbody.findAll('tr'):
        first_column = row.findAll('th')[0].contents
        third_column = row.findAll('td')[2].contents
        print (first_column, third_column)
    
    0 讨论(0)
提交回复
热议问题