How can I get href links from HTML using Python?

后端未结

关注

 10  2227

import urllib2

website = \"WEBSITE\"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

So far so good.

But I wa

相关标签:

10条回答

不要未来只要你来

2020-11-27 03:56

You can use the HTMLParser module.

The code would probably look something like this:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        # Only parse the 'anchor' tag.
        if tag == "a":
           # Check the list of defined attributes.
           for name, value in attrs:
               # If href is defined, print it.
               if name == "href":
                   print name, "=", value


parser = MyHTMLParser()
parser.feed(your_html_string)

Note: The HTMLParser module has been renamed to html.parser in Python 3.0. The 2to3 tool will automatically adapt imports when converting your sources to 3.0.

0 讨论(0)

谎友^

2020-11-27 04:02

My answer probably sucks compared to the real gurus out there, but using some simple math, string slicing, find and urllib, this little script will create a list containing link elements. I test google and my output seems right. Hope it helps!

import urllib
test = urllib.urlopen("http://www.google.com").read()
sane = 0
needlestack = []
while sane == 0:
  curpos = test.find("href")
  if curpos >= 0:
    testlen = len(test)
    test = test[curpos:testlen]
    curpos = test.find('"')
    testlen = len(test)
    test = test[curpos+1:testlen]
    curpos = test.find('"')
    needle = test[0:curpos]
    if needle.startswith("http" or "www"):
        needlestack.append(needle)
  else:
    sane = 1
for item in needlestack:
  print item

0 讨论(0)

情歌与酒

2020-11-27 04:06
This is way late to answer but it will work for latest python users:
```
from bs4 import BeautifulSoup
import requests 


html_page = requests.get('http://www.example.com').text

soup = BeautifulSoup(html_page, "lxml")
for link in soup.findAll('a'):
    print(link.get('href'))
```
Don't forget to install "requests" and "BeautifulSoup" package and also "lxml". Use .text along with get otherwise it will throw an exception.

"lxml" is used to remove that warning of which parser to be used. You can also use "html.parser" whichever fits your case.
0 讨论(0)
发布评论:

提交评论
- 加载中...

长情又很酷

2020-11-27 04:09

Here's a lazy version of @stephen's answer

import html.parser
import itertools
import urllib.request

class LinkParser(html.parser.HTMLParser):
    def reset(self):
        super().reset()
        self.links = iter([])

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for (name, value) in attrs:
                if name == 'href':
                    self.links = itertools.chain(self.links, [value])


def gen_links(stream, parser):
    encoding = stream.headers.get_content_charset() or 'UTF-8'
    for line in stream:
        parser.feed(line.decode(encoding))
        yield from parser.links

Use it like so:

>>> parser = LinkParser()
>>> stream = urllib.request.urlopen('http://stackoverflow.com/questions/3075550')
>>> links = gen_links(stream, parser)
>>> next(links)
'//stackoverflow.com'

0 讨论(0)

上一页 1 2