import urllib2
website = \"WEBSITE\"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()
print html
So far so good.
But I wa
Simplest way for me:
from urlextract import URLExtract
from requests import get
url = "sample.com/samplepage/"
req = requests.get(url)
text = req.text
# or if you already have the html source:
# text = "This is html for ex <a href='http://google.com/'>Google</a> <a href='http://yahoo.com/'>Yahoo</a>"
text = text.replace(' ', '').replace('=','')
extractor = URLExtract()
print(extractor.find_urls(text))
output:
['http://google.com/', 'http://yahoo.com/']
This answer is similar to others with requests
and BeautifulSoup
, but using list comprehension.
Because find_all()
is the most popular method in the Beautiful Soup search API, you can use soup("a")
as a shortcut of soup.findAll("a")
and using list comprehension:
import requests
from bs4 import BeautifulSoup
URL = "http://www.yourwebsite.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, features='lxml')
# Find links
all_links = [link.get("href") for link in soup("a")]
# Only external links
ext_links = [link.get("href") for link in soup("a") if "http" in link.get("href")]
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all
Using requests with BeautifulSoup and Python 3:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.website.com')
bs = BeautifulSoup(page.content, features='lxml')
for link in bs.findAll('a'):
print(link.get('href'))
Using BS4 for this specific task seems overkill.
Try instead:
website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
html = website.read()
files = re.findall('href="(.*tgz|.*tar.gz)"', html)
print sorted(x for x in (files))
I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-findall and works for me quite well.
I tested it only on my scenario of extracting a list of files from a web folder that exposes the files\folder in it, e.g.:
and I got a sorted list of the files\folders under the URL
Look at using the beautiful soup html parsing library.
http://www.crummy.com/software/BeautifulSoup/
You will do something like this:
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
for link in soup.findAll("a"):
print link.get("href")
Try with Beautifulsoup:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
print link.get('href')
In case you just want links starting with http://
, you should use:
soup.findAll('a', attrs={'href': re.compile("^http://")})
In Python 3 with BS4 it should be:
from bs4 import BeautifulSoup
import urllib.request
html_page = urllib.request.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a'):
print(link.get('href'))