Extracting text from HTML file using Python

后端 未结 30 2090
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

相关标签:
30条回答
  • 2020-11-22 04:35

    Anyone has tried bleach.clean(html,tags=[],strip=True) with bleach? it's working for me.

    0 讨论(0)
  • 2020-11-22 04:37

    The best piece of code I found for extracting text without getting javascript or not wanted things :

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
    html = urlopen(url).read()
    soup = BeautifulSoup(html, features="html.parser")
    
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out
    
    # get text
    text = soup.get_text()
    
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    
    print(text)
    

    You just have to install BeautifulSoup before :

    pip install beautifulsoup4
    
    0 讨论(0)
  • 2020-11-22 04:37

    Another example using BeautifulSoup4 in Python 2.7.9+

    includes:

    import urllib2
    from bs4 import BeautifulSoup
    

    Code:

    def read_website_to_text(url):
        page = urllib2.urlopen(url)
        soup = BeautifulSoup(page, 'html.parser')
        for script in soup(["script", "style"]):
            script.extract() 
        text = soup.get_text()
        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = '\n'.join(chunk for chunk in chunks if chunk)
        return str(text.encode('utf-8'))
    

    Explained:

    Read in the url data as html (using BeautifulSoup), remove all script and style elements, and also get just the text using .get_text(). Break into lines and remove leading and trailing space on each, then break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")). Then using text = '\n'.join, drop blank lines, finally return as sanctioned utf-8.

    Notes:

    • Some systems this is run on will fail with https:// connections because of SSL issue, you can turn off the verify to fix that issue. Example fix: http://blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/

    • Python < 2.7.9 may have some issue running this

    • text.encode('utf-8') can leave weird encoding, may want to just return str(text) instead.

    0 讨论(0)
  • 2020-11-22 04:38

    you can extract only text from HTML with BeautifulSoup

    url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
    con = urlopen(url).read()
    soup = BeautifulSoup(con,'html.parser')
    texts = soup.get_text()
    print(texts)
    
    0 讨论(0)
  • 2020-11-22 04:40

    @PeYoTIL's answer using BeautifulSoup and eliminating style and script content didn't work for me. I tried it using decompose instead of extract but it still didn't work. So I created my own which also formats the text using the <p> tags and replaces <a> tags with the href link. Also copes with links inside text. Available at this gist with a test doc embedded.

    from bs4 import BeautifulSoup, NavigableString
    
    def html_to_text(html):
        "Creates a formatted text email message as a string from a rendered html template (page)"
        soup = BeautifulSoup(html, 'html.parser')
        # Ignore anything in head
        body, text = soup.body, []
        for element in body.descendants:
            # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
            if type(element) == NavigableString:
                # We use the assumption that other tags can't be inside a script or style
                if element.parent.name in ('script', 'style'):
                    continue
    
                # remove any multiple and leading/trailing whitespace
                string = ' '.join(element.string.split())
                if string:
                    if element.parent.name == 'a':
                        a_tag = element.parent
                        # replace link text with the link
                        string = a_tag['href']
                        # concatenate with any non-empty immediately previous string
                        if (    type(a_tag.previous_sibling) == NavigableString and
                                a_tag.previous_sibling.string.strip() ):
                            text[-1] = text[-1] + ' ' + string
                            continue
                    elif element.previous_sibling and element.previous_sibling.name == 'a':
                        text[-1] = text[-1] + ' ' + string
                        continue
                    elif element.parent.name == 'p':
                        # Add extra paragraph formatting newline
                        string = '\n' + string
                    text += [string]
        doc = '\n'.join(text)
        return doc
    
    0 讨论(0)
  • 2020-11-22 04:40

    Here's the code I use on a regular basis.

    from bs4 import BeautifulSoup
    import urllib.request
    
    
    def processText(webpage):
    
        # EMPTY LIST TO STORE PROCESSED TEXT
        proc_text = []
    
        try:
            news_open = urllib.request.urlopen(webpage.group())
            news_soup = BeautifulSoup(news_open, "lxml")
            news_para = news_soup.find_all("p", text = True)
    
            for item in news_para:
                # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
                para_text = (' ').join((item.text).split())
    
                # COMBINE LINES/PARAGRAPHS INTO A LIST
                proc_text.append(para_text)
    
        except urllib.error.HTTPError:
            pass
    
        return proc_text
    

    I hope that helps.

    0 讨论(0)
提交回复
热议问题