问题
I am totally new on python and i am trying to parse an HTML document to remove the tags and I just want to keep the title and the body from a newspaper website I have previously downloaded on my computer.
I am using the class HTML Parser I found on the documentation, but I dont know how to use it very well, I dont understand this language very well :(
This is my code:
#importa la clase HTMLParser
from html.parser import HTMLParser
class HTMLCleaner(HTMLParser):
container = ""
def handle_data(self, data):
if (data == '\n'):
pass
elif (data == " "):
pass
else:
self.container += data
return self.container
parser = HTMLCleaner()
#se va a abrir un fichero para parsearlo
archivo = open("C://Users//jotab//OneDrive//Documentos//Git//SRI//SRI_PR0//coleccionESuja2019//es_26142.html", "r", encoding="utf8")
while True:
line = archivo.readline()
if line == "":
break
else:
parser.feed(line)
print(parser.container)
I am doing this because I am getting a lot of lines "\n" and a lot of lines " " after parsing. But when I try to check if a line is a blankspace, it returns false even if both variables appear on the debugger exactly the same.
I don't know why this happens, but if some1 could help me to parse this, it would be so nice
回答1:
Based on the code you provided it looks like you are trying to open a html file that you have.
Instead of parsing the html file line by line like you are doing. Just feed the parser the entire HTML file.
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
parser = MyHTMLParser()
with open(r'C:\Users\...site_1.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
parser.feed(tree)
Pythons HTML parser requires the feed to be a string. What you could do is copy paste the entire HTML that you have into the Feed. Might not be best practice but it should read and parse the html
parser.feed("THE ENTIRE HTML AS STRING HERE")
I hope this helps
Edit———-
Have you tried getting the html into a string like you have and then calling str.strip()
on the string to remove all blank spaces from leading and trailing of the string.
FYI you can also use sentence.replace(“ “, “”)
to remove all blank spaces from string
Hope this helps
来源:https://stackoverflow.com/questions/54581678/parsing-an-html-document-with-python