YouTube Data API to crawl all comments and replies

问题

I have been desperately seeking a solution to crawl all comments and corresponding replies for my research. Am having a very hard time creating a data frame that includes comment data in correct and corresponding orders.

I am gonna share my code here so you professionals can take a look and give me some insights.

def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
            comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
            comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
            if 'replies' in item.keys():
                for reply in item['replies']['comments']:
                    rauthor = reply['snippet']['authorDisplayName']
                    rtext = reply['snippet']['textDisplay']
                    rtime = reply['snippet']['publishedAt']
                    rlike = reply['snippet']['likeCount']
                    data = {'Reply ID': [rauthor], 'Reply Time': [rtime], 'Reply Comments': [rtext], 'Reply Likes': [rlike]}
                    print(rauthor)
                    print(rtext)
            data = {'Comment':[comment],'Date':[comment2],'ID':[comment3], 'Likes':[comment4]}
            result = pd.DataFrame(data)
            result.to_csv('youtube.csv', mode='a',header=False)
            print(comment)
            print(comment2)
            print(comment3)
            print(comment4)
            print('==============================')
            comments.append(comment)
                
        # Check if another page exists
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break

    return comments

When I do this, my crawler collects comments but doesn't collect some of the replies that are under certain comments.

How can I make it collect comments and their corresponding replies and put them in a single data frame?

Update

So, somehow I managed to pull the information I wanted at the output section of Jupyter Notebook. All I have to do now is to append the result at the data frame.

Here is my updated code:

def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
            comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
            comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
            if 'replies' in item.keys():
                for reply in item['replies']['comments']:
                    rauthor = reply['snippet']['authorDisplayName']
                    rtext = reply['snippet']['textDisplay']
                    rtime = reply['snippet']['publishedAt']
                    rlike = reply['snippet']['likeCount']
                    print(rtext)
                    print(rtime)
                    print(rauthor)
                    print('Likes: ', rlike)
                    
            print(comment)
            print(comment2)
            print(comment3)
            print("Likes: ", comment4)

            print('==============================')
            comments.append(comment)
                
        # Check if another page exists
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break

    return comments

The result is:

here

As you can see, the comments grouped under ======== lines are the comment and corresponding replies underneath.

What would be a good way to append the result into the data frame?

回答1:

According to the official doc, the property replies.comments[] of CommentThreads resource has the following specification:

replies.comments[] (list)
A list of one or more replies to the top-level comment. Each item in the list is a comment resource.

The list contains a limited number of replies, and unless the number of items in the list equals the value of the snippet.totalReplyCount property, the list of replies is only a subset of the total number of replies available for the top-level comment. To retrieve all of the replies for the top-level comment, you need to call the Comments.list method and use the parentId request parameter to identify the comment for which you want to retrieve replies.

Consequently, if wanting to obtain all reply entries associated to a given top-level comment, you will have to use the Comments.list API endpoint queried appropriately.

I recommend you to read my answer to a very much related question; there are three sections:

Top-Level Comments and Associated Replies,
The property nextPageToken and the parameter pageToken, and
API Limitations Imposed by Design.

From the get go, you'll have to acknowledge that the API (as currently implemented) does not allow to obtain all top-level comments associated to a given video when the number of those comments exceeds a certain (unspecified) upper bound.

For what concerns a Python implementation, I would suggest that you do structure the code as follows:

def get_video_comments(service, video_id):
    request = service.commentThreads().list(
        videoId = video_id,
        part = 'id,snippet,replies',
        maxResults = 100
    )
    comments = []

    while request:
        response = request.execute()

        for comment in response['items']:
            reply_count = comment['snippet'] \
                ['totalReplyCount']
            replies = comment.get('replies')
            if replies is not None and \
               reply_count != len(replies['comments']):
               replies['comments'] = get_comment_replies(
                   service, comment['id'])

            # 'comment' is a 'CommentThreads Resource' that has it's
            # 'replies.comments' an array of 'Comments Resource'

            # Do fill in the 'comments' data structure 
            # to be provided by this function:
            ...

        request = service.commentThreads().list_next(
            request, response)

    return comments

def get_comment_replies(service, comment_id):
    request = service.comments().list(
        parentId = comment_id,
        part = 'id,snippet',
        maxResults = 100
    )
    replies = []

    while request:
        response = request.execute()
        replies.extend(response['items'])
        request = service.comments().list_next(
            request, response)

    return replies

Note that the ellipsis dots above -- ... -- would have to be replaced with actual code that fills in the array of structures to be returned by get_video_comments to its caller.

The simplest way (useful for quick testing) would be to have ... replaced with comments.append(comment) and then the caller of get_video_comments to simply pretty print (using json.dump) the object obtained from that function.

来源：https://stackoverflow.com/questions/64275331/youtube-data-api-to-crawl-all-comments-and-replies

标签

python

dataframe

youtube

web-crawler

youtube-data-api