How do you get all the rows from a particular table using BeautifulSoup?

后端 未结 2 1645
梦谈多话
梦谈多话 2020-12-24 07:29

I am learning Python and BeautifulSoup to scrape data from the web, and read a HTML table. I can read it into Open Office and it says that it is Table #11.

It seems

相关标签:
2条回答
  • 2020-12-24 08:26

    If you ever have nested tables (as on the old-school designed websites), the above approach might fail.

    As a solution, you might want to extract non-nested tables first:

    html = '''<table>
    <tr>
    <td>Top level table cell</td>
    <td>
        <table>
        <tr><td>Nested table cell</td></tr>
        <tr><td>...another nested cell</td></tr>
        </table>
    </td>
    </tr>
    </table>'''
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    non_nested_tables = [t for t in soup.find_all('table') if not t.find_all('table')]
    

    Alternatively, if you want to extract content of all the tables, including those that nest other tables, you can extract only top-level tr and th/td headers. For this, you need to turn off recursion when calling the find_all method:

    soup = BeautifulSoup(html, 'lxml')
    tables = soup.find_all('table')
    cnt = 0
    for my_table in tables:
        cnt += 1
        print ('=============== TABLE {} ==============='.format(cnt))
        rows = my_table.find_all('tr', recursive=False)                  # <-- HERE
        for row in rows:
            cells = row.find_all(['th', 'td'], recursive=False)          # <-- HERE
            for cell in cells:
                # DO SOMETHING
                if cell.string: print (cell.string)
    

    Output:

    =============== TABLE 1 ===============
    Top level table cell
    =============== TABLE 2 ===============
    Nested table cell
    ...another nested cell
    
    0 讨论(0)
  • 2020-12-24 08:29

    This should be pretty straight forward if you have a chunk of HTML to parse with BeautifulSoup. The general idea is to navigate to your table using the findChildren method, then you can get the text value inside the cell with the string property.

    >>> from BeautifulSoup import BeautifulSoup
    >>> 
    >>> html = """
    ... <html>
    ... <body>
    ...     <table>
    ...         <th><td>column 1</td><td>column 2</td></th>
    ...         <tr><td>value 1</td><td>value 2</td></tr>
    ...     </table>
    ... </body>
    ... </html>
    ... """
    >>>
    >>> soup = BeautifulSoup(html)
    >>> tables = soup.findChildren('table')
    >>>
    >>> # This will get the first (and only) table. Your page may have more.
    >>> my_table = tables[0]
    >>>
    >>> # You can find children with multiple tags by passing a list of strings
    >>> rows = my_table.findChildren(['th', 'tr'])
    >>>
    >>> for row in rows:
    ...     cells = row.findChildren('td')
    ...     for cell in cells:
    ...         value = cell.string
    ...         print("The value in this cell is %s" % value)
    ... 
    The value in this cell is column 1
    The value in this cell is column 2
    The value in this cell is value 1
    The value in this cell is value 2
    >>> 
    
    0 讨论(0)
提交回复
热议问题