Download images from google image search (python)

老子叫甜甜 提交于 2019-12-04 21:43:07

My final solution is using icrawler.

from icrawler.examples import GoogleImageCrawler

google_crawler = GoogleImageCrawler('your_image_dir')
google_crawler.crawl(keyword='sunny', offset=0, max_num=1000,
                     date_min=None, date_max=None, feeder_thr_num=1,
                     parser_thr_num=1, downloader_thr_num=4,
                     min_size=(200,200), max_size=None)

The advantage the framework contains 5 built-in crawler (google, bing, baidu, flicker and general crawl), but it still only provide 100 images when crawl from google.

Use Google API to get results, so replace your URL by something like this:

https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=cat&rsz=8&start=0

You will get 8 results, then call again the service with start=7 to get the next ones etc. until you receive an error.

The returned data is in JSON format.

Here is a Python example I found on the web:

import urllib2
import simplejson

url = ('https://ajax.googleapis.com/ajax/services/search/images?' +
       'v=1.0&q=barack%20obama&userip=INSERT-USER-IP')

request = urllib2.Request(url, None, {'Referer': /* Enter the URL of your site here */})
response = urllib2.urlopen(request)

# Process the JSON string.
results = simplejson.load(response)
# now have some fun with the results...

As for web scrapping techniques there is this page: http://jakeaustwick.me/python-web-scraping-resource

Hope it helps.

To get 100 results, try this:

from urllib import FancyURLopener
import re
import posixpath
import urlparse 

class MyOpener(FancyURLopener, object):
    version = "Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"

myopener = MyOpener()

page = myopener.open('https://www.google.pt/search?q=love&biw=1600&bih=727&source=lnms&tbm=isch&sa=X&tbs=isz:l&tbm=isch')
html = page.read()

for match in re.finditer(r'<a href="http://www\.google\.pt/imgres\?imgurl=(.*?)&amp;imgrefurl', html, re.IGNORECASE | re.DOTALL | re.MULTILINE):
    path = urlparse.urlsplit(match.group(1)).path
    filename = posixpath.basename(path)
    myopener.retrieve(match.group(1), filename)

I can tweak biw=1600&bih=727 to get bigger or smaller images.

For any questions about icrawler, you can raise an issue on Github, which may get faster response.

The number limit for google search results seems to be 1000. A work around is to define a date range like the following.

from datetime import date
from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(
    parser_threads=2, 
    downloader_threads=4,
    storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(
    keyword='sunny',
    max_num=1000,
    date_min=date(2014, 1, 1),
    date_max=date(2015, 1, 1))
google_crawler.crawl(
    keyword='sunny',
    max_num=1000,
    date_min=date(2015, 1, 1),
    date_max=date(2016, 1, 1))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!