How can I get href links from HTML using Python?

后端未结

关注

 10  2226

import urllib2

website = \"WEBSITE\"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

So far so good.

But I wa

相关标签:

10条回答

南方客

2020-11-27 03:46

Simplest way for me:

from urlextract import URLExtract
from requests import get

url = "sample.com/samplepage/"
req = requests.get(url)
text = req.text
# or if you already have the html source:
# text = "This is html for ex <a href='http://google.com/'>Google</a> <a href='http://yahoo.com/'>Yahoo</a>"
text = text.replace(' ', '').replace('=','')
extractor = URLExtract()
print(extractor.find_urls(text))

output:

['http://google.com/', 'http://yahoo.com/']

0 讨论(0)

南旧

2020-11-27 03:48
This answer is similar to others with requests and BeautifulSoup, but using list comprehension.

Because find_all() is the most popular method in the Beautiful Soup search API, you can use soup("a") as a shortcut of soup.findAll("a") and using list comprehension:
```
import requests
from bs4 import BeautifulSoup

URL = "http://www.yourwebsite.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, features='lxml')
# Find links
all_links = [link.get("href") for link in soup("a")]
# Only external links
ext_links = [link.get("href") for link in soup("a") if "http" in link.get("href")]
```
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all
0 讨论(0)
发布评论:

提交评论
- 加载中...

佛祖请我去吃肉

2020-11-27 03:49

Using requests with BeautifulSoup and Python 3:

import requests 
from bs4 import BeautifulSoup


page = requests.get('http://www.website.com')
bs = BeautifulSoup(page.content, features='lxml')
for link in bs.findAll('a'):
    print(link.get('href'))

0 讨论(0)

不知归路

2020-11-27 03:51
Using BS4 for this specific task seems overkill.

Try instead:
```
website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
html = website.read()
files = re.findall('href="(.*tgz|.*tar.gz)"', html)
print sorted(x for x in (files))
```
I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-findall and works for me quite well.

I tested it only on my scenario of extracting a list of files from a web folder that exposes the files\folder in it, e.g.:

and I got a sorted list of the files\folders under the URL
0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2020-11-27 03:53
Look at using the beautiful soup html parsing library.

http://www.crummy.com/software/BeautifulSoup/

You will do something like this:
```
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
for link in soup.findAll("a"):
    print link.get("href")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

情歌与酒

2020-11-27 03:56

Try with Beautifulsoup:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    print link.get('href')

In case you just want links starting with http://, you should use:

soup.findAll('a', attrs={'href': re.compile("^http://")})

In Python 3 with BS4 it should be:

from bs4 import BeautifulSoup
import urllib.request

html_page = urllib.request.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a'):
    print(link.get('href'))

0 讨论(0)

1 2 下一页