Getting the nth element using BeautifulSoup

后端 未结 5 1852
生来不讨喜
生来不讨喜 2021-01-31 04:09

From a large table I want to read rows 5, 10, 15, 20 ... using BeautifulSoup. How do I do this? Is findNextSibling and an incrementing counter the way to go?

相关标签:
5条回答
  • 2021-01-31 04:24

    You could also use findAll to get all the rows in a list and after that just use the slice syntax to access the elements that you need:

    rows = soup.findAll('tr')[4::5]
    
    0 讨论(0)
  • 2021-01-31 04:24

    Here's how you could scrape every 5th distribution link on this Wikipedia page with gazpacho:

    from gazpacho import Soup
    
    url = "https://en.wikipedia.org/wiki/List_of_probability_distributions"
    soup = Soup.get(url)
    
    a_tags = soup.find("a", {"href": "distribution"})
    links = ["https://en.wikipedia.org" + a.attrs["href"] for a in a_tags]
    
    links[4::5] # start at 0,1,2,3,**4** and stride by 5
    
    0 讨论(0)
  • 2021-01-31 04:35

    As a general solution, you can convert the table to a nested list and iterate...

    import BeautifulSoup
    
    def listify(table):
      """Convert an html table to a nested list""" 
      result = []
      rows = table.findAll('tr')
      for row in rows:
        result.append([])
        cols = row.findAll('td')
        for col in cols:
          strings = [_string.encode('utf8') for _string in col.findAll(text=True)]
          text = ''.join(strings)
          result[-1].append(text)
      return result
    
    if __name__=="__main__":
        """Build a small table with one column and ten rows, then parse into a list"""
        htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr>  <tr> <td>foo6</td> </tr>  <tr> <td>foo7</td> </tr>  <tr> <td>foo8</td> </tr>  <tr> <td>foo9</td> </tr>  <tr> <td>foo10</td> </tr></table>"""
        soup = BeautifulSoup.BeautifulSoup(htstring)
        for idx, ii in enumerate(listify(soup)):
            if ((idx+1)%5>0):
                continue
            print ii
    

    Running that...

    [mpenning@Bucksnort ~]$ python testme.py
    ['foo5']
    ['foo10']
    [mpenning@Bucksnort ~]$
    
    0 讨论(0)
  • 2021-01-31 04:41

    This can be easily done with select in beautiful soup if you know the row numbers to be selected. (Note : This is in bs4)

    row = 5
    while true
        element = soup.select('tr:nth-of-type('+ row +')')
        if len(element) > 0:
            # element is your desired row element, do what you want with it 
            row += 5
        else:
            break
    
    0 讨论(0)
  • 2021-01-31 04:42

    Another option, if you prefer raw html...

    """Build a small table with one column and ten rows, then parse it into a list"""
    htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr>  <tr> <td>foo6</td> </tr>  <tr> <td>foo7</td> </tr>  <tr> <td>foo8</td> </tr>  <tr> <td>foo9</td> </tr>  <tr> <td>foo10</td> </tr></table>"""
    result = [html_tr for idx, html_tr in enumerate(soup.findAll('tr')) \
         if (idx+1)%5==0]
    print result
    

    Running that...

    [mpenning@Bucksnort ~]$ python testme.py
    [<tr> <td>foo5</td> </tr>, <tr> <td>foo10</td> </tr>]
    [mpenning@Bucksnort ~]$
    
    0 讨论(0)
提交回复
热议问题