How can I get href links from HTML using Python?

后端 未结 10 2226
自闭症患者
自闭症患者 2020-11-27 03:25
import urllib2

website = \"WEBSITE\"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

So far so good.

But I wa

相关标签:
10条回答
  • 2020-11-27 03:46

    Simplest way for me:

    from urlextract import URLExtract
    from requests import get
    
    url = "sample.com/samplepage/"
    req = requests.get(url)
    text = req.text
    # or if you already have the html source:
    # text = "This is html for ex <a href='http://google.com/'>Google</a> <a href='http://yahoo.com/'>Yahoo</a>"
    text = text.replace(' ', '').replace('=','')
    extractor = URLExtract()
    print(extractor.find_urls(text))
    
    

    output:

    ['http://google.com/', 'http://yahoo.com/']

    0 讨论(0)
  • 2020-11-27 03:48

    This answer is similar to others with requests and BeautifulSoup, but using list comprehension.

    Because find_all() is the most popular method in the Beautiful Soup search API, you can use soup("a") as a shortcut of soup.findAll("a") and using list comprehension:

    import requests
    from bs4 import BeautifulSoup
    
    URL = "http://www.yourwebsite.com"
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, features='lxml')
    # Find links
    all_links = [link.get("href") for link in soup("a")]
    # Only external links
    ext_links = [link.get("href") for link in soup("a") if "http" in link.get("href")]
    

    https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all

    0 讨论(0)
  • 2020-11-27 03:49

    Using requests with BeautifulSoup and Python 3:

    import requests 
    from bs4 import BeautifulSoup
    
    
    page = requests.get('http://www.website.com')
    bs = BeautifulSoup(page.content, features='lxml')
    for link in bs.findAll('a'):
        print(link.get('href'))
    
    0 讨论(0)
  • 2020-11-27 03:51

    Using BS4 for this specific task seems overkill.

    Try instead:

    website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
    html = website.read()
    files = re.findall('href="(.*tgz|.*tar.gz)"', html)
    print sorted(x for x in (files))
    

    I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-findall and works for me quite well.

    I tested it only on my scenario of extracting a list of files from a web folder that exposes the files\folder in it, e.g.:

    and I got a sorted list of the files\folders under the URL

    0 讨论(0)
  • 2020-11-27 03:53

    Look at using the beautiful soup html parsing library.

    http://www.crummy.com/software/BeautifulSoup/

    You will do something like this:

    import BeautifulSoup
    soup = BeautifulSoup.BeautifulSoup(html)
    for link in soup.findAll("a"):
        print link.get("href")
    
    0 讨论(0)
  • 2020-11-27 03:56

    Try with Beautifulsoup:

    from BeautifulSoup import BeautifulSoup
    import urllib2
    import re
    
    html_page = urllib2.urlopen("http://www.yourwebsite.com")
    soup = BeautifulSoup(html_page)
    for link in soup.findAll('a'):
        print link.get('href')
    

    In case you just want links starting with http://, you should use:

    soup.findAll('a', attrs={'href': re.compile("^http://")})
    

    In Python 3 with BS4 it should be:

    from bs4 import BeautifulSoup
    import urllib.request
    
    html_page = urllib.request.urlopen("http://www.yourwebsite.com")
    soup = BeautifulSoup(html_page, "html.parser")
    for link in soup.findAll('a'):
        print(link.get('href'))
    
    0 讨论(0)
提交回复
热议问题