Extracting text from HTML file using Python

后端 未结 30 2094
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  慢半拍i
    慢半拍i (楼主)
    2020-11-22 04:17

    I've had good results with Apache Tika. Its purpose is the extraction of metadata and text from content, hence the underlying parser is tuned accordingly out of the box.

    Tika can be run as a server, is trivial to run / deploy in a Docker container, and from there can be accessed via Python bindings.

提交回复
热议问题