Download from EXPLOSM.net Comics Script [Python]

烂漫一生 提交于 2020-01-14 04:59:43

问题


So I wrote this short script (correct word?) to download the comic images from explosm.net comics because I somewhat-recently found out about it and I want to...put it on my iPhone...3G.

It works fine and all. urllib2 for getting webpage html and urllib for image.retrieve()

Why I posted this on SO: how do I optimize this code? Would REGEX (regular expressions) make it faster? Is it an internet limitation? Poor algorithm...?

Any improvements in speed or general code aesthetics would be greatly appreciated "answers".

Thank you.

--------------------------------CODE----------------------------------

import urllib, urllib2

def LinkConvert(string_link):
    for eachLetter in string_link:
        if eachLetter == " ":
            string_link = string_link[:string_link.find(eachLetter)] + "%20" + string_link[string_link.find(eachLetter)+1:]
    return string_link

start = 82
end = 1506

matchingStart = """<img alt="Cyanide and Happiness, a daily webcomic" src="http://www.explosm.net/db/files/Comics/"""
matchingEnd = """></"""
link = "http://www.explosm.net/comics/"

for pageNum in range(start,start+7):
    req = urllib2.Request(link+`pageNum`)
    response = urllib2.urlopen(req)
    page = response.read()

    istart1 = page.find(matchingStart)
    iend1 = page.find(matchingEnd, istart1)
    newString1 = page[istart1 : iend1]

    istart2 = newString1.find("src=")+4
    iend2 = len(newString1)
    final = newString1[istart2 +1 : iend2 -1]

    final = LinkConvert(final)
    try:
        image = urllib.URLopener()
        image.retrieve(final, `pageNum` + ".jpg")
    except:
        print "Uh-oh! " + `pageNum` + " was not downloaded!"

    print `pageNum` + " completed..."

By the way, this is Python 2.5 code, not 3.0 but you bet I have all the features of PYthon 3.0 greatly studied and played around with before or right after New Year (after College Apps - YAY! ^-^)


回答1:


I would suggest using Scrapy for your page fetching and Beautiful Soup for the parsing. This would make your code a lot simpler.

Whether you want to change your existing code that works to these alternatives is up to you. If not, then regular expressions would probably simplify your code somewhat. I'm not sure what effect it would have on performance.




回答2:


refactormycode may be a more appropriate web site for these "let's improve this code" type of discussions.




回答3:


I suggest using BeautifulSoup to do the parsing, it would simplifly your code a lot.

But since you already got it working this way maybe you won't want to touch it until it breaks (page format changes).




回答4:


urllib2 uses blocking calls, and that's the main reason for performance. You should use a non-blocking library (like scrapy) or use multiple threads for the retrieval. I have never used scrapy (so I can't tell on that option), but threading in python is really easy and straightforward.




回答5:


Did the same today using Bash. Its really basic, but worked fine.

I first created two directories, where I put the files :

mkdir -p html/archived
mkdir png

Then, worked with two steps. First, browse all the pages :

START=15
END=4783
for ((i=START;i<=END;i++)); do
  echo $i
  wget http://explosm.net/comics/$i/ -O html/$i.html
done

#Remove 404
find html -name '*.html' -size 0 -print0 | xargs -0 rm

2nd, for each page, scrap the htmlm and retrieve the picture :

#!/bin/bash
for filename in ./html/*.html; do
  i=`echo $filename | cut -d '"' -f 4 | cut -d '/' -f3 | cut -d '.' -f1`
  echo "$filename => $i"
  wget -c "$(grep '<meta property="og:image" content=' ${filename} | cut -d '"' -f 4)" -O ./png/${i}.png
  mv $filename ./html/archived/
done

Result is here : Cyanide_and_happiness__up_to_2017-11-24.zip

Note that I didn't care much about a potential failure, but counting 4606 files, it seems mostly OK.

I also saved everything as png. They are probably jpg, and I notice 185 0-sized files, but... feel free to care about it, I just won't :)



来源:https://stackoverflow.com/questions/394978/download-from-explosm-net-comics-script-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!