问题
I'm trying to scrape Youtube URLs + Title from youtube accounts which are formatted like https://www.youtube.com/c/%s/videos %accountName
. for example Apple
The class given to the clickable text (title) in Youtube is ytd-grid-video-renderer #video-title.yt-simple-endpoint.ytd-grid-video-renderer
- When clicking on the title object in inspector mode (Firefox)
I am not getting any results, but the url 'url
' (somewhere in webCommandMetadata) and title 'simpleText
' are showing in the request.content
Example:
url = "https://www.youtube.com/c/%s/videos" % account
req = requests.get(url, timeout=30)
soup = BeautifulSoup(req.content, 'html.parser')
# latest_videos_html = soup.select('.yt-lockup-content:not(:has(span.yt-uix-livereminder)) .yt-lockup-title a')[:6]
# latest_videos_html = soup.select('.yt-lockup-content:not(:has(span.yt-uix-livereminder)) .yt-simple-endpoint a')[:18]
latest_videos_html = soup.select('ytd-grid-video-renderer #video-title.yt-simple-endpoint.ytd-grid-video-renderer')[:18]
print(latest_videos_html)`
My question is: How do I know what to input in the soup.select
and how do I debug this so I could fix this in the future myself?
Thanks for your support!
回答1:
The content you see in the browser is loaded mostly by javascript. By using simple GET requests you do not receive the dynamic content of the page.
By looking at users' pages on YouTube, I can see you do not get a lot of proper HTML information, but rather you get JSONs in the body
tag.
To answer your question, in the future when you want to scrape something from a website, first make sure you actually have the content when using requests.get
rather than assuming that you get the same content a browser gets.
Now, specifically for the YouTube problem, if you save req.text
in a file and open it in a file editor and open the <body>
tag, you will see that under the <script>
tag (the second one) the variable window["ytInitialData"]
is set to a very-very long JSON.
Inside it there is all the available info you need for every video (title, duration, video ID, etc.). I suggest you parse that JSON and see if it solves your problem.
来源:https://stackoverflow.com/questions/63199271/beautifulsoup-python-youtube-scrape-not-working