问题
bookmarks.html looks like this:
<DT><A HREF="http://www.youtube.com/watch?v=Gg81zi0pheg" ADD_DATE="1320876124" LAST_MODIFIED="1320878745" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="">http://www.youtube.com/watch?v=Gg81zi0pheg</A>
<DT><A HREF="http://www.youtube.com/watch?v=pP9VjGmmhfo" ADD_DATE="1320876156" LAST_MODIFIED="1320878756" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="">http://www.youtube.com/watch?v=pP9VjGmmhfo</A>
<DT><A HREF="http://www.youtube.com/watch?v=yTA1u6D1fyE" ADD_DATE="1320876163" LAST_MODIFIED="1320878762" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="">http://www.youtube.com/watch?v=yTA1u6D1fyE</A>
<DT><A HREF="http://www.youtube.com/watch?v=4v8HvQf4fgE" ADD_DATE="1320876186" LAST_MODIFIED="1320878767" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="">http://www.youtube.com/watch?v=4v8HvQf4fgE</A>
<DT><A HREF="http://www.youtube.com/watch?v=e9zG20wQQ1U" ADD_DATE="1320876195" LAST_MODIFIED="1320878773" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="">http://www.youtube.com/watch?v=e9zG20wQQ1U</A>
<DT><A HREF="http://www.youtube.com/watch?v=khL4s2bvn-8" ADD_DATE="1320876203" LAST_MODIFIED="1320878782" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="">http://www.youtube.com/watch?v=khL4s2bvn-8</A>
<DT><A HREF="http://www.youtube.com/watch?v=XTndQ7bYV0A" ADD_DATE="1320876271" LAST_MODIFIED="1320876271">Paramore - Walmart Soundcheck 6-For a pessimist(HQ)</A>
<DT><A HREF="http://www.youtube.com/watch?v=xTT2MqgWRRc" ADD_DATE="1320876284" LAST_MODIFIED="1320876284">Paramore - Walmart Soundcheck 5-Pressure(HQ)</A>
<DT><A HREF="http://www.youtube.com/watch?v=J2ZYQngwSUw" ADD_DATE="1320876291" LAST_MODIFIED="1320876291">Paramore - Wal-Mart Soundcheck Interview</A>
<DT><A HREF="http://www.youtube.com/watch?v=9RZwvg7unrU" ADD_DATE="1320878207" LAST_MODIFIED="1320878207">Paramore - 08 - Interview [ Wal-Mart Soundcheck ]</A>
<DT><A HREF="http://www.youtube.com/watch?v=vz3qOYWwm10" ADD_DATE="1320878295" LAST_MODIFIED="1320878295">Paramore - 04 - That's What You Get [ Wal-Mart Soundcheck ]</A>
<DT><A HREF="http://www.youtube.com/watch?v=yarv52QX_Yw" ADD_DATE="1320878301" LAST_MODIFIED="1320878301">Paramore - 05 - Pressure [ Wal-Mart Soundcheck ]</A>
<DT><A HREF="http://www.youtube.com/watch?v=LRREY1H3GCI" ADD_DATE="1320878317" LAST_MODIFIED="1320878317">Paramore - Walmart Promo</A>
It's a standard bookmarks export file from Firefox.
I feed it into bookmarks.py which looks like this:
#!/usr/bin/env python
import sys
import BeautifulSoup as bs
from BeautifulSoup import BeautifulSoup
url_list = sys.argv[1]
urls = [tag['href'] for tag in
BeautifulSoup(open(url_list)).findAll('a')]
print urls
This returns a much more clean list of urls:
[u'http://www.youtube.com/watch?v=Gg81zi0pheg', u'http://www.youtube.com/watch?v=pP9VjGmmhfo', u'http://www.youtube.com/watch?v=yTA1u6D1fyE', u'http://www.youtube.com/watch?v=4v8HvQf4fgE', u'http://www.youtube.com/watch?v=e9zG20wQQ1U', u'http://www.youtube.com/watch?v=khL4s2bvn-8', u'http://www.youtube.com/watch?v=XTndQ7bYV0A', u'http://www.youtube.com/watch?v=xTT2MqgWRRc', u'http://www.youtube.com/watch?v=J2ZYQngwSUw', u'http://www.youtube.com/watch?v=9RZwvg7unrU', u'http://www.youtube.com/watch?v=vz3qOYWwm10', u'http://www.youtube.com/watch?v=yarv52QX_Yw', u'http://www.youtube.com/watch?v=LRREY1H3GCI']
my next step is to get each of the youtube urls into video_info.py
#!/usr/bin/python
import urlparse
import sys
import gdata.youtube
import gdata.youtube.service
import re
import urlparse
import urllib2
youtube_url = sys.argv[1]
url_data = urlparse.urlparse(youtube_url)
query = urlparse.parse_qs(url_data.query)
youtube_id = query["v"][0]
print youtube_id
yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = 'AI39si4yOmI0GEhSTXH0nkiVDf6tQjCkqoys5BBYLKEr-PQxWJ0IlwnUJAcdxpocGLBBCapdYeMLIsB7KVC_OA8gYK0VKV726g'
entry = yt_service.GetYouTubeVideoEntry(video_id=youtube_id)
print 'Video title: %s' % entry.media.title.text
print 'Video view count: %s' % entry.statistics.view_count
when this url "http://www.youtube.com/watch?v=aXrgwC1rsw4" the output looks like this:
aXrgwC1rsw4
Video title: OneRepublic Good Life Live Walmart Soundcheck
Video view count: 202
How do I feed the list of urls from bookmarks.py into video_info.py?
*extra points for output to csv format and extra extra points of checking of duplicates in bookmarks.html before passing data to video_info.py*
Thanks for all your help guys. Because of Stackoverflow I've gotten this far.
David
#So combined I now have:
#!/usr/bin/env python
import urlparse
import gdata.youtube
import gdata.youtube.service
import re
import urlparse
import urllib2
import sys
import BeautifulSoup as bs
from BeautifulSoup import BeautifulSoup
yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = 'AI39si4yOmI0GEhSTXH0nkiVDf6tQjCkqoys5BBYLKEr-PQxWJ0IlwnUJAcdxpocGLBBCapdYeMLIsB7KVC_OA8gYK0VKV726g'
url_list = sys.argv[1]
urls = [tag['href'] for tag in
BeautifulSoup(open(url_list)).findAll('a')]
print urls
youtube_url = urls
url_data = urlparse.urlparse(youtube_url)
query = urlparse.parse_qs(url_data.query)
youtube_id = query["v"][0]
#list(set(my_list))
entry = yt_service.GetYouTubeVideoEntry(video_id=youtube_id)
myyoutubes = []
myyoutubes.append(", ".join([youtube_id, entry.media.title.text,entry.statistics.view_count]))
print "\n".join(myyoutubes)
How do I pass the list of urls to the youtube_url variable? they need to be cleaned up further and passed one at a time I believe
I've got it down to this now:
#!/usr/bin/env python
import urlparse
import gdata.youtube
import gdata.youtube.service
import re
import urlparse
import urllib2
import sys
import BeautifulSoup as bs
from BeautifulSoup import BeautifulSoup
yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = 'AI39si4yOmI0GEhSTXH0nkiVDf6tQjCkqoys5BBYLKEr-PQxWJ0IlwnUJAcdxpocGLBBCapdYeMLIsB7KVC_OA8gYK0VKV726g'
url_list = sys.argv[1]
urls = [tag['href'] for tag in
BeautifulSoup(open(url_list)).findAll('a')]
for url in urls:
youtube_url = url
url_data = urlparse.urlparse(youtube_url)
query = urlparse.parse_qs(url_data.query)
youtube_id = query["v"][0]
#list(set(my_list))
entry = yt_service.GetYouTubeVideoEntry(video_id=youtube_id)
myyoutubes = []
myyoutubes.append(", ".join([youtube_id, entry.media.title.text,entry.statistics.view_count]))
print "\n".join(myyoutubes)
I can pass bookmarks.html to combined.py but it only returns the first line.
How to I loop through each line of youtube_url?
回答1:
You should provide a string to BeautifulSoup
:
# parse bookmarks.html
with open(sys.argv[1]) as bookmark_file:
soup = BeautifulSoup(bookmark_file.read())
# extract youtube video urls
video_url_regex = re.compile('http://www.youtube.com/watch')
urls = [link['href'] for link in soup('a', href=video_url_regex)]
Separate a very fast url parsing from a much longer downloading of the stats:
# extract video ids from the urls
ids = [] # you could use `set()` and `ids.add()` to avoid duplicates
for video_url in urls:
url = urlparse.urlparse(video_url)
video_id = urlparse.parse_qs(url.query).get('v')
if not video_id: continue # no video_id in the url
ids.append(video_id[0])
You don't need to authenticate for readonly requests:
# get some statistics for the videos
yt_service = YouTubeService()
yt_service.ssl = True #NOTE: it works for readonly requests
yt_service.debug = True # show requests
Save some statistics to a csv file provided on the command-line. Don't stop if some video causes an error:
writer = csv.writer(open(sys.argv[2], 'wb')) # save to cvs file
for video_id in ids:
try:
entry = yt_service.GetYouTubeVideoEntry(video_id=video_id)
except Exception, e:
print >>sys.stderr, "Failed to retrieve entry video_id=%s: %s" %(
video_id, e)
else:
title = entry.media.title.text
print "Title:", title
view_count = entry.statistics.view_count
print "View count:", view_count
writer.writerow((video_id, title, view_count)) # write it
Here's a full script, press playback to watch how it was written.
Output
$ python download-video-stats.py neudorfer.html out.csv
send: u'GET https://gdata.youtube.com/feeds/api/videos/Gg81zi0pheg HTTP/1.1\r\nAcc
ept-Encoding: identity\r\nHost: gdata.youtube.com\r\nContent-Type: application/ato
m+xml\r\nUser-Agent: None GData-Python/2.0.15\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: X-GData-User-Country: RU
header: Content-Type: application/atom+xml; charset=UTF-8
header: Expires: Thu, 10 Nov 2011 19:31:23 GMT
header: Date: Thu, 10 Nov 2011 19:31:23 GMT
header: Cache-Control: private, max-age=300, no-transform
header: Vary: *
header: GData-Version: 1.0
header: Last-Modified: Wed, 02 Nov 2011 08:58:11 GMT
header: Transfer-Encoding: chunked
header: X-Content-Type-Options: nosniff
header: X-Frame-Options: SAMEORIGIN
header: X-XSS-Protection: 1; mode=block
header: Server: GSE
Title: Paramore - Let The Flames Begin [Wal-Mart Soundcheck]
View count: 27807
out.csv
Gg81zi0pheg,Paramore - Let The Flames Begin [Wal-Mart Soundcheck],27807 pP9VjGmmhfo,Paramore: Wal-Mart Soundcheck,1363078 yTA1u6D1fyE,Paramore-Walmart Soundcheck 7-CrushCrushCrush(HQ),843 4v8HvQf4fgE,Paramore-Walmart Soundcheck 4-That's What You Get(HQ),1429 e9zG20wQQ1U,Paramore-Walmart Soundcheck 8-Interview(HQ),1306 khL4s2bvn-8,Paramore-Walmart Soundcheck 3-Emergency(HQ),796 XTndQ7bYV0A,Paramore-Walmart Soundcheck 6-For a pessimist(HQ),599 xTT2MqgWRRc,Paramore-Walmart Soundcheck 5-Pressure(HQ),963 J2ZYQngwSUw,Paramore - Wal-Mart Soundcheck Interview,10261 9RZwvg7unrU,Paramore - 08 - Interview [Wal-Mart Soundcheck],1674 vz3qOYWwm10,Paramore - 04 - That's What You Get [Wal-Mart Soundcheck],1268 yarv52QX_Yw,Paramore - 05 - Pressure [Wal-Mart Soundcheck],1296 LRREY1H3GCI,Paramore - Walmart Promo,523
回答2:
Why not just combine the two files? Also, you may want to break it up into methods to make it easier to understand later.
Also, for csv you're going to want to accumulate your data. So, maybe have a list and each time through append a csv line of youtube info:
myyoutubes = []
...
myyoutubes.append(", ".join([youtubeid, entry.media.title.text,entry.statistics.view_count]))
...
"\n".join(myyoutubes)
For duplicates, I normally do this: list(set(my_list)) Sets only have unique elements.
来源:https://stackoverflow.com/questions/8083268/youtube-video-id-from-firefox-bookmark-html-source-code-almost-there