Easy way to get data between tags of xml or html files in python?

后端 未结 6 797
时光取名叫无心
时光取名叫无心 2021-02-06 17:05

I am using Python and need to find and retrieve all character data between tags:

I need this stuff

I then want to output

6条回答
  •  佛祖请我去吃肉
    2021-02-06 17:37

    Use xpath and lxml;

    from lxml import etree
    
    pageInMemory = open("pageToParse.html", "r")
    
    parsedPage = etree.HTML(pageInMemory)
    
    yourListOfText = parsedPage.xpath("//tag//text()")
    
    saveFile = open("savedFile", "w")
    saveFile.writelines(yourListOfText)
    
    pageInMemory.close()
    saveFile.close()
    

    Faster than Beautiful soup.

    If you want to test out your Xpath's - I find FireFox's Xpather extremely helpful.

    Further Notes:

    • lxml-an-underappreciated-web-scraping-library
    • web-scraping-with-lxml

提交回复
热议问题