Easy way to get data between tags of xml or html files in python?

后端 未结 6 792
时光取名叫无心
时光取名叫无心 2021-02-06 17:05

I am using Python and need to find and retrieve all character data between tags:

I need this stuff

I then want to output

相关标签:
6条回答
  • 2021-02-06 17:16

    I quite like parsing into element tree and then using element.text and element.tail.

    It also has xpath like searching

    >>> from xml.etree.ElementTree import ElementTree
    >>> tree = ElementTree()
    >>> tree.parse("index.xhtml")
    <Element html at b7d3f1ec>
    >>> p = tree.find("body/p")     # Finds first occurrence of tag p in body
    >>> p
    <Element p at 8416e0c>
    >>> p.text
    "Some text in the Paragraph"
    >>> links = p.getiterator("a")  # Returns list of all links
    >>> links
    [<Element a at b7d4f9ec>, <Element a at b7d4fb0c>]
    >>> for i in links:             # Iterates through all found links
    ...     i.attrib["target"] = "blank"
    >>> tree.write("output.xhtml")
    
    0 讨论(0)
  • 2021-02-06 17:21

    Beautiful Soup is a wonderful HTML/XML parser for Python:

    Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

    1. Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
    2. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
    3. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
    0 讨论(0)
  • 2021-02-06 17:22

    without external modules, eg

    >>> myhtml = """ <tag>I need this stuff</tag>
    ... blah blah
    ... <tag>I need this stuff too
    ... </tag>
    ... blah blah """
    >>> for item in myhtml.split("</tag>"):
    ...   if "<tag>" in item:
    ...       print item [ item.find("<tag>")+len("<tag>") : ]
    ...
    I need this stuff
    I need this stuff too
    
    0 讨论(0)
  • 2021-02-06 17:29

    This is how I am doing it:

        (myhtml.split('<tag>')[1]).split('</tag>')[0]
    

    Tell me if it worked!

    0 讨论(0)
  • 2021-02-06 17:30
    def value_tag(s):
        i = s.index('>')
        s = s[i+1:]
        i = s.index('<')
        s = s[:i]
        return s
    
    0 讨论(0)
  • 2021-02-06 17:37

    Use xpath and lxml;

    from lxml import etree
    
    pageInMemory = open("pageToParse.html", "r")
    
    parsedPage = etree.HTML(pageInMemory)
    
    yourListOfText = parsedPage.xpath("//tag//text()")
    
    saveFile = open("savedFile", "w")
    saveFile.writelines(yourListOfText)
    
    pageInMemory.close()
    saveFile.close()
    

    Faster than Beautiful soup.

    If you want to test out your Xpath's - I find FireFox's Xpather extremely helpful.

    Further Notes:

    • lxml-an-underappreciated-web-scraping-library
    • web-scraping-with-lxml
    0 讨论(0)
提交回复
热议问题