Beautifulsoup Python Youtube Scrape not working

空扰寡人 提交于 2020-12-13 04:04:11

问题


I'm trying to scrape Youtube URLs + Title from youtube accounts which are formatted like https://www.youtube.com/c/%s/videos %accountName. for example Apple

The class given to the clickable text (title) in Youtube is ytd-grid-video-renderer #video-title.yt-simple-endpoint.ytd-grid-video-renderer - When clicking on the title object in inspector mode (Firefox)

I am not getting any results, but the url 'url' (somewhere in webCommandMetadata) and title 'simpleText' are showing in the request.content

Example:

url = "https://www.youtube.com/c/%s/videos" % account
req = requests.get(url, timeout=30)
soup = BeautifulSoup(req.content, 'html.parser')
# latest_videos_html = soup.select('.yt-lockup-content:not(:has(span.yt-uix-livereminder)) .yt-lockup-title a')[:6]
# latest_videos_html = soup.select('.yt-lockup-content:not(:has(span.yt-uix-livereminder)) .yt-simple-endpoint a')[:18]
latest_videos_html = soup.select('ytd-grid-video-renderer #video-title.yt-simple-endpoint.ytd-grid-video-renderer')[:18]

print(latest_videos_html)`

My question is: How do I know what to input in the soup.select and how do I debug this so I could fix this in the future myself?

Thanks for your support!


回答1:


The content you see in the browser is loaded mostly by javascript. By using simple GET requests you do not receive the dynamic content of the page.

By looking at users' pages on YouTube, I can see you do not get a lot of proper HTML information, but rather you get JSONs in the body tag.

To answer your question, in the future when you want to scrape something from a website, first make sure you actually have the content when using requests.get rather than assuming that you get the same content a browser gets.

Now, specifically for the YouTube problem, if you save req.text in a file and open it in a file editor and open the <body> tag, you will see that under the <script> tag (the second one) the variable window["ytInitialData"] is set to a very-very long JSON.

Inside it there is all the available info you need for every video (title, duration, video ID, etc.). I suggest you parse that JSON and see if it solves your problem.



来源:https://stackoverflow.com/questions/63199271/beautifulsoup-python-youtube-scrape-not-working

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!