Extracting text from HTML file using Python

后端 未结 30 2147
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  被撕碎了的回忆
    2020-11-22 04:33

    In Python 3.x you can do it in a very easy way by importing 'imaplib' and 'email' packages. Although this is an older post but maybe my answer can help new comers on this post.

    status, data = self.imap.fetch(num, '(RFC822)')
    email_msg = email.message_from_bytes(data[0][1]) 
    #email.message_from_string(data[0][1])
    
    #If message is multi part we only want the text version of the body, this walks the message and gets the body.
    
    if email_msg.is_multipart():
        for part in email_msg.walk():       
            if part.get_content_type() == "text/plain":
                body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
                body = body.decode()
            elif part.get_content_type() == "text/html":
                continue
    

    Now you can print body variable and it will be in plaintext format :) If it is good enough for you then it would be nice to select it as accepted answer.

提交回复
热议问题