Parsing an HTML Document with python

问题

I am totally new on python and i am trying to parse an HTML document to remove the tags and I just want to keep the title and the body from a newspaper website I have previously downloaded on my computer.

I am using the class HTML Parser I found on the documentation, but I dont know how to use it very well, I dont understand this language very well :(

This is my code:

#importa la clase HTMLParser
from html.parser import HTMLParser

class HTMLCleaner(HTMLParser):
    container = ""

    def handle_data(self, data):
        if (data == '\n'):
            pass
        elif (data == " "):
            pass
        else:
            self.container += data

        return self.container

parser = HTMLCleaner()

#se va a abrir un fichero para parsearlo
archivo = open("C://Users//jotab//OneDrive//Documentos//Git//SRI//SRI_PR0//coleccionESuja2019//es_26142.html", "r", encoding="utf8")


while True:
    line = archivo.readline()
    if line == "":
        break
    else:
        parser.feed(line)

print(parser.container)

I am doing this because I am getting a lot of lines "\n" and a lot of lines " " after parsing. But when I try to check if a line is a blankspace, it returns false even if both variables appear on the debugger exactly the same.

I don't know why this happens, but if some1 could help me to parse this, it would be so nice

回答1:

Based on the code you provided it looks like you are trying to open a html file that you have.

Instead of parsing the html file line by line like you are doing. Just feed the parser the entire HTML file.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()

with open(r'C:\Users\...site_1.html', "r") as f:
    page = f.read()
    tree = html.fromstring(page)
parser.feed(tree)

Pythons HTML parser requires the feed to be a string. What you could do is copy paste the entire HTML that you have into the Feed. Might not be best practice but it should read and parse the html

parser.feed("THE ENTIRE HTML AS STRING HERE")

I hope this helps

Edit———- Have you tried getting the html into a string like you have and then calling str.strip() on the string to remove all blank spaces from leading and trailing of the string.

FYI you can also use sentence.replace(“ “, “”) to remove all blank spaces from string

Hope this helps

来源：https://stackoverflow.com/questions/54581678/parsing-an-html-document-with-python

标签

python

html

parsing

html-parsing