Extracting text from HTML file using Python

后端 未结 30 2011
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

相关标签:
30条回答
  • 2020-11-22 04:13

    Beautiful soup does convert html entities. It's probably your best bet considering HTML is often buggy and filled with unicode and html encoding issues. This is the code I use to convert html to raw text:

    import BeautifulSoup
    def getsoup(data, to_unicode=False):
        data = data.replace(" ", " ")
        # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
        masssage_bad_comments = [
            (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
            (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
        ]
        myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
        myNewMassage.extend(masssage_bad_comments)
        return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
            convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                        if to_unicode else None)
    
    remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""
    
    0 讨论(0)
  • 2020-11-22 04:14

    Here is a version of xperroni's answer which is a bit more complete. It skips script and style sections and translates charrefs (e.g., &#39;) and HTML entities (e.g., &amp;).

    It also includes a trivial plain-text-to-html inverse converter.

    """
    HTML <-> text conversions.
    """
    from HTMLParser import HTMLParser, HTMLParseError
    from htmlentitydefs import name2codepoint
    import re
    
    class _HTMLToText(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self)
            self._buf = []
            self.hide_output = False
    
        def handle_starttag(self, tag, attrs):
            if tag in ('p', 'br') and not self.hide_output:
                self._buf.append('\n')
            elif tag in ('script', 'style'):
                self.hide_output = True
    
        def handle_startendtag(self, tag, attrs):
            if tag == 'br':
                self._buf.append('\n')
    
        def handle_endtag(self, tag):
            if tag == 'p':
                self._buf.append('\n')
            elif tag in ('script', 'style'):
                self.hide_output = False
    
        def handle_data(self, text):
            if text and not self.hide_output:
                self._buf.append(re.sub(r'\s+', ' ', text))
    
        def handle_entityref(self, name):
            if name in name2codepoint and not self.hide_output:
                c = unichr(name2codepoint[name])
                self._buf.append(c)
    
        def handle_charref(self, name):
            if not self.hide_output:
                n = int(name[1:], 16) if name.startswith('x') else int(name)
                self._buf.append(unichr(n))
    
        def get_text(self):
            return re.sub(r' +', ' ', ''.join(self._buf))
    
    def html_to_text(html):
        """
        Given a piece of HTML, return the plain text it contains.
        This handles entities and char refs, but not javascript and stylesheets.
        """
        parser = _HTMLToText()
        try:
            parser.feed(html)
            parser.close()
        except HTMLParseError:
            pass
        return parser.get_text()
    
    def text_to_html(text):
        """
        Convert the given text to html, wrapping what looks like URLs with <a> tags,
        converting newlines to <br> tags and converting confusing chars into html
        entities.
        """
        def f(mo):
            t = mo.group()
            if len(t) == 1:
                return {'&':'&amp;', "'":'&#39;', '"':'&quot;', '<':'&lt;', '>':'&gt;'}.get(t)
            return '<a href="%s">%s</a>' % (t, t)
        return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)
    
    0 讨论(0)
  • 2020-11-22 04:16

    if you need more speed and less accuracy then you could use raw lxml.

    import lxml.html as lh
    from lxml.html.clean import clean_html
    
    def lxml_to_text(html):
        doc = lh.fromstring(html)
        doc = clean_html(doc)
        return doc.text_content()
    
    0 讨论(0)
  • Another non-python solution: Libre Office:

    soffice --headless --invisible --convert-to txt input1.html
    

    The reason I prefer this one over other alternatives is that every HTML paragraph gets converted into a single text line (no line breaks), which is what I was looking for. Other methods require post-processing. Lynx does produce nice output, but not exactly what I was looking for. Besides, Libre Office can be used to convert from all sorts of formats...

    0 讨论(0)
  • 2020-11-22 04:17

    I've had good results with Apache Tika. Its purpose is the extraction of metadata and text from content, hence the underlying parser is tuned accordingly out of the box.

    Tika can be run as a server, is trivial to run / deploy in a Docker container, and from there can be accessed via Python bindings.

    0 讨论(0)
  • 2020-11-22 04:17

    in a simple way

    import re
    
    html_text = open('html_file.html').read()
    text_filtered = re.sub(r'<(.*?)>', '', html_text)
    

    this code finds all parts of the html_text started with '<' and ending with '>' and replace all found by an empty string

    0 讨论(0)
提交回复
热议问题