Beautiful Soup [Python] and the extracting of text in a table

前端 未结 3 2095
不思量自难忘°
不思量自难忘° 2021-02-14 12:33

i am new to Python and to Beatiful Soup also! I heard about BS. It is told to be a great tool to parse and extract content. So here i am...:

I want to take the content

相关标签:
3条回答
  • 2021-02-14 13:02

    Use "text" to get text between "td"

    1) First read table DOM using tag or ID

    soup = BeautifulSoup(self.driver.page_source, "html.parser")
    htnm_migration_table = soup.find("table", {'id':'htnm_migration_table'})
    

    2) Read tbody

    tbody = htnm_migration_table.find('tbody')
    

    3) Read all tr from tbody tag

    trs = tbody.find_all('tr')
    

    4) get all tds using tr

    for tr in trs:
          tds = tr.find_all('td')
          for td in tds:
          print(td.text)
    
    0 讨论(0)
  • 2021-02-14 13:14

    I find Beautiful Soup very efficient tool so keep learning it :-) It is able to parse a page with invalid markup so it should be able to handle the page you refer. You may want to use command BeautifulSoup(html).prettify() command if you want to get a valid reformatted page source with valid markup.

    As for your question, the result of your first soup.findAll(...) command is also a Beautiful Soup object and you can make a second search in it, like this:

    table_soup = soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'})
    your_sample_text = table_soup.find("td").renderContents().strip()
    
    print your_sample_text
    
    0 讨论(0)
  • 2021-02-14 13:20

    First find the table (as you are doing). Using find rather than findall returns the first item in the list (rather than returning a list of all finds - in which case we'd have to add an extra [0] to take the first element of the list):

    table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})
    

    Then use find again to find the first td:

    first_td = table.find('td')
    

    Then use renderContents() to extract the textual contents:

    text = first_td.renderContents()
    

    ... and the job is done (though you may also want to use strip() to remove leading and trailing spaces:

    trimmed_text = text.strip()
    

    This should give:

    >>> print trimmed_text
    This is a sample text
    >>>
    

    as desired.

    0 讨论(0)
提交回复
热议问题