How can I get href links from HTML using Python?

后端 未结 10 2227
自闭症患者
自闭症患者 2020-11-27 03:25
import urllib2

website = \"WEBSITE\"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

So far so good.

But I wa

相关标签:
10条回答
  • You can use the HTMLParser module.

    The code would probably look something like this:

    from HTMLParser import HTMLParser
    
    class MyHTMLParser(HTMLParser):
    
        def handle_starttag(self, tag, attrs):
            # Only parse the 'anchor' tag.
            if tag == "a":
               # Check the list of defined attributes.
               for name, value in attrs:
                   # If href is defined, print it.
                   if name == "href":
                       print name, "=", value
    
    
    parser = MyHTMLParser()
    parser.feed(your_html_string)
    

    Note: The HTMLParser module has been renamed to html.parser in Python 3.0. The 2to3 tool will automatically adapt imports when converting your sources to 3.0.

    0 讨论(0)
  • 2020-11-27 04:02

    My answer probably sucks compared to the real gurus out there, but using some simple math, string slicing, find and urllib, this little script will create a list containing link elements. I test google and my output seems right. Hope it helps!

    import urllib
    test = urllib.urlopen("http://www.google.com").read()
    sane = 0
    needlestack = []
    while sane == 0:
      curpos = test.find("href")
      if curpos >= 0:
        testlen = len(test)
        test = test[curpos:testlen]
        curpos = test.find('"')
        testlen = len(test)
        test = test[curpos+1:testlen]
        curpos = test.find('"')
        needle = test[0:curpos]
        if needle.startswith("http" or "www"):
            needlestack.append(needle)
      else:
        sane = 1
    for item in needlestack:
      print item
    
    0 讨论(0)
  • 2020-11-27 04:06

    This is way late to answer but it will work for latest python users:

    from bs4 import BeautifulSoup
    import requests 
    
    
    html_page = requests.get('http://www.example.com').text
    
    soup = BeautifulSoup(html_page, "lxml")
    for link in soup.findAll('a'):
        print(link.get('href'))
    

    Don't forget to install "requests" and "BeautifulSoup" package and also "lxml". Use .text along with get otherwise it will throw an exception.

    "lxml" is used to remove that warning of which parser to be used. You can also use "html.parser" whichever fits your case.

    0 讨论(0)
  • 2020-11-27 04:09

    Here's a lazy version of @stephen's answer

    import html.parser
    import itertools
    import urllib.request
    
    class LinkParser(html.parser.HTMLParser):
        def reset(self):
            super().reset()
            self.links = iter([])
    
        def handle_starttag(self, tag, attrs):
            if tag == 'a':
                for (name, value) in attrs:
                    if name == 'href':
                        self.links = itertools.chain(self.links, [value])
    
    
    def gen_links(stream, parser):
        encoding = stream.headers.get_content_charset() or 'UTF-8'
        for line in stream:
            parser.feed(line.decode(encoding))
            yield from parser.links
    

    Use it like so:

    >>> parser = LinkParser()
    >>> stream = urllib.request.urlopen('http://stackoverflow.com/questions/3075550')
    >>> links = gen_links(stream, parser)
    >>> next(links)
    '//stackoverflow.com'
    
    0 讨论(0)
提交回复
热议问题