Extracting text from HTML file using Python

后端 未结 30 2014
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

相关标签:
30条回答
  • 2020-11-22 04:26

    The LibreOffice writer comment has merit since the application can employ python macros. It seems to offer multiple benefits both for answering this question and furthering the macro base of LibreOffice. If this resolution is a one-off implementation, rather than to be used as part of a greater production program, opening the HTML in writer and saving the page as text would seem to resolve the issues discussed here.

    0 讨论(0)
  • 2020-11-22 04:27

    PyParsing does a great job. The PyParsing wiki was killed so here is another location where there are examples of the use of PyParsing (example link). One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O'Reilly Short Cut manual that is also inexpensive.

    Having said that, I use BeautifulSoup a lot and it is not that hard to deal with the entities issues, you can convert them before you run BeautifulSoup.

    Goodluck

    0 讨论(0)
  • 2020-11-22 04:28

    There is Pattern library for data mining.

    http://www.clips.ua.ac.be/pages/pattern-web

    You can even decide what tags to keep:

    s = URL('http://www.clips.ua.ac.be').download()
    s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
    print s
    
    0 讨论(0)
  • 2020-11-22 04:28

    Another option is to run the html through a text based web browser and dump it. For example (using Lynx):

    lynx -dump html_to_convert.html > converted_html.txt
    

    This can be done within a python script as follows:

    import subprocess
    
    with open('converted_html.txt', 'w') as outputFile:
        subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)
    

    It won't give you exactly just the text from the HTML file, but depending on your use case it may be preferable to the output of html2text.

    0 讨论(0)
  • 2020-11-22 04:29

    This isn't exactly a Python solution, but it will convert text Javascript would generate into text, which I think is important (E.G. google.com). The browser Links (not Lynx) has a Javascript engine, and will convert source to text with the -dump option.

    So you could do something like:

    fname = os.tmpnam()
    fname.write(html_source)
    proc = subprocess.Popen(['links', '-dump', fname], 
                            stdout=subprocess.PIPE,
                            stderr=open('/dev/null','w'))
    text = proc.stdout.read()
    
    0 讨论(0)
  • 2020-11-22 04:29

    I recommend a Python Package called goose-extractor Goose will try to extract the following information:

    Main text of an article Main image of article Any Youtube/Vimeo movies embedded in article Meta Description Meta tags

    More :https://pypi.python.org/pypi/goose-extractor/

    0 讨论(0)
提交回复
热议问题