Download images from google image search (python)

冷暖自知 提交于 2019-12-22 01:06:42

问题


I am web scraping beginner. I am firstly refer to https://www.youtube.com/watch?v=ZAUNEEtzsrg to download image with the specific tag(e.g. cat), and it works! But I encountered new problem which only can download about 100 images, and this problem seems like "ajax" which only load the first page html and not load all. Therefore, it seem like we must simulate scroll down to download next 100 images or more.

My code: https://drive.google.com/file/d/0Bwjk-LKe_AohNk9CNXVQbGRxMHc/edit?usp=sharing

To sum up,the problems are following:

  1. how to download all images in google image search by source code in python( Please give me some examples :) )

  2. Have any web scraping technique I must need to know?


回答1:


My final solution is using icrawler.

from icrawler.examples import GoogleImageCrawler

google_crawler = GoogleImageCrawler('your_image_dir')
google_crawler.crawl(keyword='sunny', offset=0, max_num=1000,
                     date_min=None, date_max=None, feeder_thr_num=1,
                     parser_thr_num=1, downloader_thr_num=4,
                     min_size=(200,200), max_size=None)

The advantage the framework contains 5 built-in crawler (google, bing, baidu, flicker and general crawl), but it still only provide 100 images when crawl from google.




回答2:


Use Google API to get results, so replace your URL by something like this:

https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=cat&rsz=8&start=0

You will get 8 results, then call again the service with start=7 to get the next ones etc. until you receive an error.

The returned data is in JSON format.

Here is a Python example I found on the web:

import urllib2
import simplejson

url = ('https://ajax.googleapis.com/ajax/services/search/images?' +
       'v=1.0&q=barack%20obama&userip=INSERT-USER-IP')

request = urllib2.Request(url, None, {'Referer': /* Enter the URL of your site here */})
response = urllib2.urlopen(request)

# Process the JSON string.
results = simplejson.load(response)
# now have some fun with the results...

As for web scrapping techniques there is this page: http://jakeaustwick.me/python-web-scraping-resource

Hope it helps.




回答3:


To get 100 results, try this:

from urllib import FancyURLopener
import re
import posixpath
import urlparse 

class MyOpener(FancyURLopener, object):
    version = "Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"

myopener = MyOpener()

page = myopener.open('https://www.google.pt/search?q=love&biw=1600&bih=727&source=lnms&tbm=isch&sa=X&tbs=isz:l&tbm=isch')
html = page.read()

for match in re.finditer(r'<a href="http://www\.google\.pt/imgres\?imgurl=(.*?)&amp;imgrefurl', html, re.IGNORECASE | re.DOTALL | re.MULTILINE):
    path = urlparse.urlsplit(match.group(1)).path
    filename = posixpath.basename(path)
    myopener.retrieve(match.group(1), filename)

I can tweak biw=1600&bih=727 to get bigger or smaller images.




回答4:


For any questions about icrawler, you can raise an issue on Github, which may get faster response.

The number limit for google search results seems to be 1000. A work around is to define a date range like the following.

from datetime import date
from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(
    parser_threads=2, 
    downloader_threads=4,
    storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(
    keyword='sunny',
    max_num=1000,
    date_min=date(2014, 1, 1),
    date_max=date(2015, 1, 1))
google_crawler.crawl(
    keyword='sunny',
    max_num=1000,
    date_min=date(2015, 1, 1),
    date_max=date(2016, 1, 1))


来源:https://stackoverflow.com/questions/25133865/download-images-from-google-image-search-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!