Youtube Data Api Page Token Question (python)

问题

I try to download the video metadata for year 2019. It exceeds the quota limit every time I run my codes. I have under 100 videos for that period of time. Can anyone show me a better way to write the codes?

   try: 
    request = youtube.search().list(
        part = 'id, snippet',
        type = 'video',
        publishedAfter = '2018-12-31T23:59:59Z',
        publishedBefore = '2020-01-01T00:00:00Z',
        order = 'date',
        fields = 'nextPageToken,items(id,snippet)',
        pageToken = None,
        maxResults = 50
    )
    response = request.execute()
    nextPageToken = None

    while True:
        request = youtube.search().list(
        pageToken = nextPageToken,
        part = 'id, snippet',
        type = 'video',
        fields = 'nextPageToken,items(id,snippet)',
        maxResults = 50
        )

        response = request.execute()
        nextPageToken = response['nextPageToken']
        items = response['items']
        if response['nextPageToken'] == None:
            break
        for each_item in items:
            video_id = each_item['id']['videoId']
            sub_items = each_item['snippet']
            for sub_item in sub_items:
                video_item[sub_item] = sub_items[sub_item ]

            video_data[video_id] = video_item
except Exception as e:
    print('Error in get_video_data: {0}'.format(e))

Thanks!

回答1:

Please acknowledge that your API call to the Search.list endpoint is running against the whole set of YouTube videos of that one year period; your API call doesn't specify any other filtering criteria, which means that your query (upon pagination) would potentially return millions of video entries.

If in fact you're looking for your own videos, then your Search.list endpoint call should include either the forMine or the channelId request parameters:

when you've constructed your youtube object from the discovery.build method using its parameter credentials (that is you're issuing an authorized request), then use the request parameter forMine as shown below:

request = youtube.search().list(
    forMine = True,
    part = 'id,snippet',
    type = 'video',
    publishedAfter = '2018-12-31T23:59:59Z',
    publishedBefore = '2020-01-01T00:00:00Z',
    order = 'date',
    fields = 'nextPageToken,items(id,snippet)',
    maxResults = 50
)

Do note that this alternative proved to not be viable as per the findings documented below under the section Updates and fixes.

when you've constructed your youtube object from the discovery.build method using its parameter developerKey (that is you're not issuing an authorized request), then use the request parameter channelId as shown below:

request = youtube.search().list(
    channelId = CHANNEL_ID,
    part = 'id,snippet',
    type = 'video',
    publishedAfter = '2018-12-31T23:59:59Z',
    publishedBefore = '2020-01-01T00:00:00Z',
    order = 'date',
    fields = 'nextPageToken,items(id,snippet)',
    maxResults = 50
)

Note that CHANNEL_ID is the ID of your channel (or any other channel for that matter).

The difference between the two kinds of API calls above is the following: when issuing an authorized request (first bullet above), you'll get all videos of your channel, including those that are non-public (i.e. those that have their privacyStatus set to private or unlisted); on the other hand, when using an API key (the second bullet above), you'll get only the public videos (i.e. those that have their privacyStatus set to public), even if CHANNEL_ID is the ID of your own channel.

Now, unfortunately, your code above has another issue: your two Search.list endpoint calls are not identical, modulo the pageToken request parameter. That's because the second call does not get the request parameters publishedAfter and publishedBefore.

This difference implies that you're not paginating correctly the result set of your first API call (indeed, even if passing the parameter pageToken to the second API call).

Fortunately, the Google's APIs Client Library for Python that you're using implements API result set pagination in a simple pythonic way (I'll exemplify below the case of the second bullet above):

request = youtube.search().list(
    channelId = CHANNEL_ID,
    part = 'id,snippet',
    type = 'video',
    publishedAfter = '2018-12-31T23:59:59Z',
    publishedBefore = '2020-01-01T00:00:00Z',
    order = 'date',
    fields = 'nextPageToken,items(id,snippet)',
    maxResults = 50
)
video_data = {}

while request:
    response = request.execute()

    for item in response['items']:
        video_id = item['id']['videoId']
        video_item = item['snippet']
        video_data[video_id] = video_item

    request = youtube.search().list_next(
        request, response)

The code above shows that is not necessary to repeat the first API call in its entirety, with an added pageToken parameter; suffices to have the simpler statement:

    request = youtube.search().list_next(
        request, response)

This statement uses the value of the nextPageToken property of the response object for to construct from the old request object a new one having a properly set pageToken property.

Update and fixes

Upon further tests and investigations with respect to the invocation of Search.list with the request parameters forMine, publishedAfter and publishedBefore as above, I came to the following conclusion:

the parameter forMine=True given without any of the parameters publishedAfter and publishedBefore makes the API call to work as expected;
the parameter forMine=True given along with any of the parameters publishedAfter and publishedBefore or with both produces the HTTP error 400 Bad Request along with the JSON error response:

{
  "error": {
    "code": 400,
    "message": "Request contains an invalid argument.",
    "errors": [
      {
        "message": "Request contains an invalid argument.",
        "domain": "global",
        "reason": "badRequest"
      }
    ],
    "status": "INVALID_ARGUMENT"
  }
}

Google's own issue tracker records a very recent bug report that describes precisely the behavior above. The official response from Google's staff was the following:

Status: Won't Fix (Intended Behavior)

This is working as intended. Basically you can only set one of the resource filters if it's a for_content_owner request, but both channel ID and published after are resource filters. This requirement doesn't seem to be specified on the developer website: https://developers.google.com/youtube/v3/docs/search/list.

来源：https://stackoverflow.com/questions/65443193/youtube-data-api-page-token-question-python

标签

python

youtube-data-api