问题
I want to scrape news from this website:
https://www.bbc.com/news
You can see that website has categories such as Home, US Election, Coronavirus etc.
For example, If I go to specific news article such as: https://www.bbc.com/news/election-us-2020-54912611
I can write a scraper that will give me the headline, this is the code:
from bs4 import BeautifulSoup
response = requests.get("https://www.bbc.com/news/election-us-2020-54912611", headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.select("header h1")
print(title)
On this website there are hundreds of news, so my question is, Is there a way to access each news article thats on the website (all categories) from the home page url? On home page I cant see all news articles, I can see only some of them, so is there a way for me to load whole HTML code for whole website, so that I can easily get all news headlines with:
soup.select("header h1")
回答1:
Ok, then after getting this headlines you can also have another links in this page, you just again open that links and fetch information from that links it can look like this:
visited = set()
links = [....]
while links:
if link_for_fetch in visited:
continue
link_for_fetch = links.pop()
content = get_contents(link_for_fetch)
headlines += parse_headlines()
links += parse_links()
visited.add(link_for_fetch)
it's just pseudocode, you can write in any programming language. but this can take a lot of time for parsing whole site :( and robots can block your ip address
来源:https://stackoverflow.com/questions/64804060/accessing-all-elements-from-main-website-page-with-beautiful-soup