问题
I try to download the video metadata for year 2019. It exceeds the quota limit every time I run my codes. I have under 100 videos for that period of time. Can anyone show me a better way to write the codes?
try:
request = youtube.search().list(
part = 'id, snippet',
type = 'video',
publishedAfter = '2018-12-31T23:59:59Z',
publishedBefore = '2020-01-01T00:00:00Z',
order = 'date',
fields = 'nextPageToken,items(id,snippet)',
pageToken = None,
maxResults = 50
)
response = request.execute()
nextPageToken = None
while True:
request = youtube.search().list(
pageToken = nextPageToken,
part = 'id, snippet',
type = 'video',
fields = 'nextPageToken,items(id,snippet)',
maxResults = 50
)
response = request.execute()
nextPageToken = response['nextPageToken']
items = response['items']
if response['nextPageToken'] == None:
break
for each_item in items:
video_id = each_item['id']['videoId']
sub_items = each_item['snippet']
for sub_item in sub_items:
video_item[sub_item] = sub_items[sub_item ]
video_data[video_id] = video_item
except Exception as e:
print('Error in get_video_data: {0}'.format(e))
Thanks!
回答1:
Please acknowledge that your API call to the Search.list endpoint is running against the whole set of YouTube videos of that one year period; your API call doesn't specify any other filtering criteria, which means that your query (upon pagination) would potentially return millions of video entries.
If in fact you're looking for your own videos, then your Search.list
endpoint call should include either the forMine or the channelId request parameters:
- when you've constructed your
youtube
object from the discovery.build method using its parametercredentials
(that is you're issuing an authorized request), then use the request parameter forMine as shown below:
request = youtube.search().list(
forMine = True,
part = 'id,snippet',
type = 'video',
publishedAfter = '2018-12-31T23:59:59Z',
publishedBefore = '2020-01-01T00:00:00Z',
order = 'date',
fields = 'nextPageToken,items(id,snippet)',
maxResults = 50
)
Do note that this alternative proved to not be viable as per the findings documented below under the section Updates and fixes.
- when you've constructed your
youtube
object from the discovery.build method using its parameterdeveloperKey
(that is you're not issuing an authorized request), then use the request parameter channelId as shown below:
request = youtube.search().list(
channelId = CHANNEL_ID,
part = 'id,snippet',
type = 'video',
publishedAfter = '2018-12-31T23:59:59Z',
publishedBefore = '2020-01-01T00:00:00Z',
order = 'date',
fields = 'nextPageToken,items(id,snippet)',
maxResults = 50
)
Note that CHANNEL_ID
is the ID of your channel (or any other channel for that matter).
The difference between the two kinds of API calls above is the following: when issuing an authorized request (first bullet above), you'll get all videos of your channel, including those that are non-public (i.e. those that have their privacyStatus set to private
or unlisted
); on the other hand, when using an API key (the second bullet above), you'll get only the public videos (i.e. those that have their privacyStatus set to public
), even if CHANNEL_ID
is the ID of your own channel.
Now, unfortunately, your code above has another issue: your two Search.list
endpoint calls are not identical, modulo the pageToken
request parameter. That's because the second call does not get the request parameters publishedAfter
and publishedBefore
.
This difference implies that you're not paginating correctly the result set of your first API call (indeed, even if passing the parameter pageToken
to the second API call).
Fortunately, the Google's APIs Client Library for Python that you're using implements API result set pagination in a simple pythonic way (I'll exemplify below the case of the second bullet above):
request = youtube.search().list(
channelId = CHANNEL_ID,
part = 'id,snippet',
type = 'video',
publishedAfter = '2018-12-31T23:59:59Z',
publishedBefore = '2020-01-01T00:00:00Z',
order = 'date',
fields = 'nextPageToken,items(id,snippet)',
maxResults = 50
)
video_data = {}
while request:
response = request.execute()
for item in response['items']:
video_id = item['id']['videoId']
video_item = item['snippet']
video_data[video_id] = video_item
request = youtube.search().list_next(
request, response)
The code above shows that is not necessary to repeat the first API call in its entirety, with an added pageToken
parameter; suffices to have the simpler statement:
request = youtube.search().list_next(
request, response)
This statement uses the value of the nextPageToken
property of the response
object for to construct from the old request
object a new one having a properly set pageToken
property.
Update and fixes
Upon further tests and investigations with respect to the invocation of Search.list
with the request parameters forMine
, publishedAfter
and publishedBefore
as above, I came to the following conclusion:
the parameter
forMine=True
given without any of the parameterspublishedAfter
andpublishedBefore
makes the API call to work as expected;the parameter
forMine=True
given along with any of the parameterspublishedAfter
andpublishedBefore
or with both produces the HTTP error400 Bad Request
along with the JSON error response:
{
"error": {
"code": 400,
"message": "Request contains an invalid argument.",
"errors": [
{
"message": "Request contains an invalid argument.",
"domain": "global",
"reason": "badRequest"
}
],
"status": "INVALID_ARGUMENT"
}
}
Google's own issue tracker records a very recent bug report that describes precisely the behavior above. The official response from Google's staff was the following:
Status: Won't Fix (Intended Behavior)
This is working as intended. Basically you can only set one of the resource filters if it's a for_content_owner request, but both channel ID and published after are resource filters. This requirement doesn't seem to be specified on the developer website: https://developers.google.com/youtube/v3/docs/search/list.
来源:https://stackoverflow.com/questions/65443193/youtube-data-api-page-token-question-python