How to extract and download all images from a website using beautifulSoup?

前端 未结 2 1730
南笙
南笙 2020-11-27 18:43

I am trying to extract and download all images from a url. I wrote a script

import urllib2
import re
from os.path import basename
from urlparse import urlsp         


        
相关标签:
2条回答
  • 2020-11-27 19:08

    The following should extract all images from a given page and write it to the directory where the script is being run.

    import re
    import requests
    from bs4 import BeautifulSoup
    
    site = 'http://pixabay.com'
    
    response = requests.get(site)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    img_tags = soup.find_all('img')
    
    urls = [img['src'] for img in img_tags]
    
    
    for url in urls:
        filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
        if not filename:
             print("Regex didn't match with the url: {}".format(url))
             continue
        with open(filename.group(1), 'wb') as f:
            if 'http' not in url:
                # sometimes an image source can be relative 
                # if it is provide the base url which also happens 
                # to be the site variable atm. 
                url = '{}{}'.format(site, url)
            response = requests.get(url)
            f.write(response.content)
    
    0 讨论(0)
  • 2020-11-27 19:21

    If you want only pictures then you can just download them without even scrapping the webpage. The all have the same URL:

    http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute1.jpg
    http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute2.jpg
    ...
    http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute10.jpg
    

    So simple code as that will give you all images:

    import os
    import urllib
    import urllib2
    
    
    baseUrl = "http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-"\
          "cutest-pics-gallery/cute%s.jpg"
    
    for i in range(1,11):
        url = baseUrl % i
        urllib.urlretrieve(url, os.path.basename(url))
    

    With Beautifulsoup you will have to click or go to the next page to scrap the images. If you want ot scrap each page individually try to scrathem using there class which is shutterset_katrina-kaifs-top-10-cutest-pics-gallery

    0 讨论(0)
提交回复
热议问题