Convert
to end line

前端 未结 6 1797
别跟我提以往
别跟我提以往 2020-12-08 19:00

I\'m trying to extract some text using BeautifulSoup. I\'m using get_text() function for this purpose.

My problem is that the text contain

相关标签:
6条回答
  • 2020-12-08 19:28

    Adding to Ian's and dividebyzero's post/comments you can do this to efficiently filter/replace many tags in one go:

    for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
        elem.replace_with(elem.text + "\n\n")
    
    0 讨论(0)
  • 2020-12-08 19:35

    A regex should do the trick.

    import re
    s = re.sub('<br\s*?>', '\n', yourTextHere)
    

    Hope this helps!

    0 讨论(0)
  • 2020-12-08 19:39

    As official doc says:

    You can specify a string to be used to join the bits of text together: soup.get_text("\n")

    0 讨论(0)
  • 2020-12-08 19:39

    If you call element.text you'll get the text without br tags. Maybe you need define your own custom method for this purpose:

         def clean_text(elem):
            text = ''
            for e in elem.descendants:
                if isinstance(e, str):
                    text += e.strip()
                elif e.name == 'br' or e.name == 'p':
                    text += '\n'
            return text
    
        # get page content
        soup = BeautifulSoup(request_response.text, 'html.parser')
        # get your target element
        description_div = soup.select_one('.description-class')
        # clean the data
        print(clean_text(description_div))
    
    0 讨论(0)
  • 2020-12-08 19:46

    Instead of replacing the tags with \n, it may be better to just add a \n to the end of all of the tags that matter.

    To steal the list from @petezurich:

    for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
        elem.append('\n')
    
    0 讨论(0)
  • 2020-12-08 19:51

    You can do this using the BeautifulSoup object itself, or any element of it:

    for br in soup.find_all("br"):
        br.replace_with("\n")
    
    0 讨论(0)
提交回复
热议问题