Can I extract comments of any page from https://www.rt.com/ using python3?

不打扰是莪最后的温柔 提交于 2020-01-07 06:25:27

问题


I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it?


回答1:


RT are using a service from spot.im for comments

you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.

i wrote a quick script to do this

import requests
import re
import json

def get_rt_comments(article_url):
    spotim_spotId = 'sp_6phY2k0C' # spotim id for RT
    post_id = re.search('([0-9]+)', article_url).group(0)

    r1 = requests.post('https://api.spot.im/me/network-token/spotim').json()
    spotim_token = r1['token']

    payload = {
        "count": 25, #number of comments to fetch
        "sort_by":"best",
        "cursor":{"offset":0,"comments_read":0},
        "host_url": article_url,
        "canonical_url": article_url
    }

    r2_url ='https://api.spot.im/conversation-read/spot/' + spotim_spotId + '/post/'+ post_id +'/get'
    r2 = requests.post(r2_url, data=json.dumps(payload), headers={'X-Spotim-Token': spotim_token , "Content-Type": "application/json"})

    return r2.json()

if __name__ == '__main__':
    url = 'https://www.rt.com/usa/353493-clinton-speech-affairs-silence/'
    comments = get_rt_comments(url)
    print(comments)



回答2:


Yes, if it can be viewed with a web browser, you can extract it.

If you look at the source it is really an iframe that loads a piece of javascript, that then creates a new tag in the document with the source of that script tag loading bundle.js, which really contains the commenting software. This in turns then fetches the actual comments.

Instead of going through this manually, you could consider using for example webkit to create a headless browser that executes the javascript like an ordinary browser. Then you can scrape from that instead of having to manually make your crawler fetch the external resources.

Examples of such headless browsers could be Spynner, Dryscape, or the PhantomJS derived PhantomPy (the latter seems to be an abandoned project now).



来源:https://stackoverflow.com/questions/38607502/can-i-extract-comments-of-any-page-from-https-www-rt-com-using-python3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!