Python convert html to text and mimic formatting

后端 未结 4 1034
挽巷
挽巷 2020-12-31 13:56

I\'m learning BeautifulSoup, and found many \"html2text\" solutions, but the one i\'m looking for should mimic the formatting:

  • One
相关标签:
4条回答
  • 2020-12-31 14:10

    I have code for a more simple task: Remove HTML tags, and insert newlines at the appropriate places. Maybe this can be a starting point for you.

    Python's textwrap module might be helpful for creating indented blocks of text.

    http://docs.python.org/2/library/textwrap.html

    class HtmlTool(object):
        """
        Algorithms to process HTML.
        """
        #Regular expressions to recognize different parts of HTML. 
        #Internal style sheets or JavaScript 
        script_sheet = re.compile(r"<(script|style).*?>.*?(</\1>)", 
                                  re.IGNORECASE | re.DOTALL)
        #HTML comments - can contain ">"
        comment = re.compile(r"<!--(.*?)-->", re.DOTALL) 
        #HTML tags: <any-text>
        tag = re.compile(r"<.*?>", re.DOTALL)
        #Consecutive whitespace characters
        nwhites = re.compile(r"[\s]+")
        #<p>, <div>, <br> tags and associated closing tags
        p_div = re.compile(r"</?(p|div|br).*?>", 
                           re.IGNORECASE | re.DOTALL)
        #Consecutive whitespace, but no newlines
        nspace = re.compile("[^\S\n]+", re.UNICODE)
        #At least two consecutive newlines
        n2ret = re.compile("\n\n+")
        #A return followed by a space
        retspace = re.compile("(\n )")
    
        #For converting HTML entities to unicode
        html_parser = HTMLParser.HTMLParser()
    
        @staticmethod
        def to_nice_text(html):
            """Remove all HTML tags, but produce a nicely formatted text."""
            if html is None:
                return u""
            text = unicode(html)
            text = HtmlTool.script_sheet.sub("", text)
            text = HtmlTool.comment.sub("", text)
            text = HtmlTool.nwhites.sub(" ", text)
            text = HtmlTool.p_div.sub("\n", text) #convert <p>, <div>, <br> to "\n"
            text = HtmlTool.tag.sub("", text)     #remove all tags
            text = HtmlTool.html_parser.unescape(text)
            #Get whitespace right
            text = HtmlTool.nspace.sub(" ", text)
            text = HtmlTool.retspace.sub("\n", text)
            text = HtmlTool.n2ret.sub("\n\n", text)
            text = text.strip()
            return text
    

    There might be some superfluous regexes left in the code.

    0 讨论(0)
  • 2020-12-31 14:30

    Python's built-in html.parser (HTMLParser in earlier versions) module can be easily extended to create a simple translator that you can tailor to your exact needs. It lets you hook into certain events as the parser eats through the HTML.

    Due to its simple nature you cant navigate around the HTML tree like you could with Beautiful Soup (e.g. sibling, child, parent nodes etc) but for a simple case like yours it should be enough.

    html.parser homepage

    In your case you could use it like this by adding the appropriate formatting whenever a start-tag or end-tag of a specific type is encountered :

    from html.parser import HTMLParser
    from os import linesep
    
    class MyHTMLParser(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self, strict=False)
        def feed(self, in_html):
            self.output = ""
            super(MyHTMLParser, self).feed(in_html)
            return self.output
        def handle_data(self, data):
            self.output += data.strip()
        def handle_starttag(self, tag, attrs):
            if tag == 'li':
                self.output += linesep + '* '
            elif tag == 'blockquote' :
                self.output += linesep + linesep + '\t'
        def handle_endtag(self, tag):
            if tag == 'blockquote':
                self.output += linesep + linesep
    
    parser = MyHTMLParser()
    content = "<ul><li>One</li><li>Two</li></ul>"
    print(linesep + "Example 1:")
    print(parser.feed(content))
    content = "Some text<blockquote>More magnificent text here</blockquote>Final text"
    print(linesep + "Example 2:")
    print(parser.feed(content))
    
    0 讨论(0)
  • 2020-12-31 14:31

    While using samaspin's solution, if there are non english unicode characters, then the parser stops working and just returns an empty string. Initialising the parser for each loop ensures that the even if the parser object gets corrupted, it does not return empty string for the subsequent parsings. Adding to samaspin's solution ,the handling of the <br> tag as well. In term of processing the HTML code and not cleaning the html tags, the subsequent tags can be added and their expected output written in the function handle_starttag

                class MyHTMLParser(HTMLParser):
                """
                This class will be used to clean the html tags whilst ensuring the
                format is maintained. Therefore all the whitespces, newlines, linebrakes, etc are
                converted from html tags to their respective counterparts in python.
    
                """
    
                def __init__(self):
                    HTMLParser.__init__(self)
    
                def feed(self, in_html):
                    self.output = ""
                    super(MyHTMLParser, self).feed(in_html)
                    return self.output
    
                def handle_data(self, data):
                    self.output += data.strip()
    
                def handle_starttag(self, tag, attrs):
                    if tag == 'li':
                        self.output += linesep + '* '
                    elif tag == 'blockquote':
                        self.output += linesep + linesep + '\t'
                    elif tag == 'br':
                        self.output += linesep + '\n'
    
                def handle_endtag(self, tag):
                    if tag == 'blockquote':
                        self.output += linesep + linesep
    
    
            parser = MyHTMLParser()
    
    0 讨论(0)
  • 2020-12-31 14:32

    Take a look at Aaron Swartz's html2text script (can be installed with pip install html2text). Note that the output is valid Markdown. If for some reason that doesn't fully suit you, some rather trivial tweaks should get you the exact output in your question:

    In [1]: import html2text
    
    In [2]: h1 = """<ul>
       ...: <li>One</li>
       ...: <li>Two</li>
       ...: </ul>"""
    
    In [3]: print html2text.html2text(h1)
      * One
      * Two
    
    In [4]: h2 = """<p>Some text
       ...: <blockquote>
       ...: More magnificent text here
       ...: </blockquote>
       ...: Final text</p>"""
    
    In [5]: print html2text.html2text(h2)
    Some text
    
    > More magnificent text here
    
    Final text
    
    0 讨论(0)
提交回复
热议问题