问题
I've tried to scrape yt-formatted strings with BeautifulSoup, but it always gives me an error. Here is my code:
import requests
import bs4
from bs4 import BeautifulSoup
r = requests.get('https://www.youtube.com/channel/UCPyMcv4yIDfETZXoJms1XFA')
soup = bs4.BeautifulSoup(r.text, "html.parser")
def onoroff():
onoroff = soup.find('yt-formatted-string',{'id','subscriber-count'}).text
return onoroff
print("Subscribers: "+str(onoroff().strip()))
This is the error I get
AttributeError: 'NoneType' object has no attribute 'text'
Is there another way to scrape yt-formatted-strings?
回答1:
Most of Youtube content is generated via JavaScript, capability that BeautifulSoup don't have, but you can get luck by scrapping the json objects on the source code, but not the HTML elements directly, i.e.:
import requests, json, re
h = {
'Host': 'www.youtube.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0',
'Accept': '*/*',
'Accept-Language': 'en-US,pt;q=0.7,en;q=0.3',
'Referer': 'https://www.youtube.com/channel/UCPyMcv4yIDfETZXoJms1XFA',
}
u = "https://www.youtube.com/channel/UCPyMcv4yIDfETZXoJms1XFA"
html = requests.get(u, headers=h).text
# lets get the json object that contains all the info we need from the source code and convert it into a python dict that we can use later
matches = re.findall(r'window\["ytInitialData"\] = (.*\}\]\}\}\});', html, re.IGNORECASE | re.DOTALL)
if matches:
j = json.loads(matches[0])
# browse the json object and search the info you need : https://jsoneditoronline.org/#left=cloud.123ad9bb8bbe498c95f291c32962aad2
# We are now ready to get the the number of subscribers (among other info):
subscribers = j['header']['c4TabbedHeaderRenderer']['subscriberCountText']['runs'][0]["text"]
print(subscribers)
# 110 subscribers
Demo
来源:https://stackoverflow.com/questions/61427391/scrape-yt-formatted-strings-with-beautiful-soup