Removing new line '\n' from the output of python BeautifulSoup

前端 未结 3 1891
难免孤独
难免孤独 2021-01-12 01:52

I am using python Beautiful soup to get the contents of:

abc def
相关标签:
3条回答
  • 2021-01-12 01:58

    If you just strip items in breadcrum you would end up with empty item in your list. You can either do as shaktimaan suggested and then use

    breadcrum = filter(None, breadcrum)
    

    Or you can strip them all before hand (in html_doc):

    mystring = mystring.replace('\n', ' ').replace('\r', '')
    

    Either way to get your string output, do something like this:

    ','.join(breadcrum)
    
    0 讨论(0)
  • 2021-01-12 02:15

    Unless I'm missing something, just combine strip and list comprehension.

    Code:

    from bs4 import BeautifulSoup as bsoup
    
    ofile = open("test.html", "r")
    soup = bsoup(ofile)
    
    res = ",".join([a.get_text().strip() for a in soup.find("div", class_="path").find_all("a")])
    print res
    

    Result:

    abc,def,ghi
    [Finished in 0.2s]
    
    0 讨论(0)
  • 2021-01-12 02:17

    You could do this:

    breadcrum = [item.strip() for item in breadcrum if str(item)]
    

    The if str(item) will take care of getting rid of the empty list items after stripping the new line characters.

    If you want to join the strings, then do:

    ','.join(breadcrum)
    

    This will give you abc,def,ghi

    EDIT

    Although the above gives you what you want, as pointed out by others in the thread, the way you are using BS to extract anchor texts is not correct. Once you have the div of your interest, you should be using it to get it's children and then get the anchor text. As:

    path = soup.find('div',attrs={'class':'path'})
    anchors = path.find_all('a')
    data = []
    for ele in anchors:
        data.append(ele.text)
    

    And then do a ','.join(data)

    0 讨论(0)
提交回复
热议问题