Going through HTML DOM in Python

后端 未结 2 1743
南旧
南旧 2021-01-03 02:10

I\'m looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.

I currently have

相关标签:
2条回答
  • 2021-01-03 02:12

    There are many different modules you could use. For example, lxml or BeautifulSoup.

    Here's an lxml example:

    import lxml.html
    
    mysite = urllib.request.urlopen('http://www.google.com').read()
    lxml_mysite = lxml.html.fromstring(mysite)
    
    description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
    text = description.get('content') # content attribute of the tag
    
    >>> print(text)
    "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
    

    And a BeautifulSoup example:

    from bs4 import BeautifulSoup
    
    mysite = urllib.request.urlopen('http://www.google.com').read()
    soup_mysite = BeautifulSoup(mysite)
    
    description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
    text = description['content'] # text of content attribute
    
    >>> print(text)
    u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
    

    Notice how BeautifulSoup returns a unicode string, while lxml does not. This can be useful/hurtful depending on what is needed.

    0 讨论(0)
  • 2021-01-03 02:25

    Check out the BeautifulSoup module.

    from bs4 import BeautifulSoup
    import urllib                                       
    soup = BeautifulSoup(urllib.urlopen("http://google.com").read())
    
    for link in soup.find_all('a'):
        print(link.get('href'))
    
    0 讨论(0)
提交回复
热议问题