Extract image links from the webpage using Python

后端 未结 3 679
走了就别回头了
走了就别回头了 2021-01-07 00:39

So I wanted to get all of the pictures on this page(of the nba teams). http://www.cbssports.com/nba/draft/mock-draft

However, my code gives a bit more than that. It

相关标签:
3条回答
  • 2021-01-07 01:18

    I know this can be "traumatic", but for those automatically generated pages, where you just want to grab the damn images away and never come back, a quick-n-dirty regular expression that takes the desired pattern tends to be my choice (no Beautiful Soup dependency is a great advantage):

    import urllib, re
    
    source = urllib.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read()
    
    ## every image name is an abbreviation composed by capital letters, so...
    for link in re.findall('http://sports.cbsimg.net/images/nba/logos/30x30/[A-Z]*.png', source):
        print link
    
    
        ## the code above just prints the link;
        ## if you want to actually download, set the flag below to True
    
        actually_download = False
        if actually_download:
            filename = link.split('/')[-1]
            urllib.urlretrieve(link, filename)
    

    Hope this helps!

    0 讨论(0)
  • 2021-01-07 01:28

    To save all the images on http://www.cbssports.com/nba/draft/mock-draft,

    import urllib2
    import os
    from BeautifulSoup import BeautifulSoup
    URL = "http://www.cbssports.com/nba/draft/mock-draft"
    default_dir = os.path.join(os.path.expanduser("~"),"Pictures")
    opener = urllib2.build_opener()
    urllib2.install_opener(opener)
    soup = BeautifulSoup(urllib2.urlopen(URL).read())
    imgs = soup.findAll("img",{"alt":True, "src":True})
    for img in imgs:
        img_url = img["src"]
        filename = os.path.join(default_dir, img_url.split("/")[-1])
        img_data = opener.open(img_url)
        f = open(filename,"wb")
        f.write(img_data.read())
        f.close()
    

    To save any particular image on http://www.cbssports.com/nba/draft/mock-draft, use

    soup.find("img",{"src":"image_name_from_source"})
    
    0 讨论(0)
  • 2021-01-07 01:41

    You can use this functions for getting the list of all images url from url.

    #
    #
    # get_url_images_in_text()
    #
    # @param html - the html to extract urls of images from him.
    # @param protocol - the protocol of the website, for append to urls that not start with protocol.
    #
    # @return list of imags url.
    #
    #
    def get_url_images_in_text(html, protocol):
        urls = []
        all_urls = re.findall(r'((http\:|https\:)?\/\/[^"\' ]*?\.(png|jpg))', html, flags=re.IGNORECASE | re.MULTILINE | re.UNICODE)
        for url in all_urls:
            if not url[0].startswith("http"):
                urls.append(protocol + url[0])
            else:
                urls.append(url[0])
    
        return urls
    
    #
    #
    # get_images_from_url()
    #
    # @param url - the url for extract images url from him. 
    #
    # @return list of images url.
    #
    #
    def get_images_from_url(url):
        protocol = url.split('/')[0]
        resp = requests.get(url)
        return get_url_images_in_text(resp.text, protocol)
    
    0 讨论(0)
提交回复
热议问题