Accessing all elements from main website page with Beautiful Soup

让人想犯罪 __ 提交于 2021-02-11 12:49:47

问题


I want to scrape news from this website:

https://www.bbc.com/news

You can see that website has categories such as Home, US Election, Coronavirus etc.

For example, If I go to specific news article such as: https://www.bbc.com/news/election-us-2020-54912611

I can write a scraper that will give me the headline, this is the code:

from bs4 import BeautifulSoup
    
response = requests.get("https://www.bbc.com/news/election-us-2020-54912611", headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
    
title = soup.select("header h1")
print(title)

On this website there are hundreds of news, so my question is, Is there a way to access each news article thats on the website (all categories) from the home page url? On home page I cant see all news articles, I can see only some of them, so is there a way for me to load whole HTML code for whole website, so that I can easily get all news headlines with:

soup.select("header h1")

回答1:


Ok, then after getting this headlines you can also have another links in this page, you just again open that links and fetch information from that links it can look like this:

visited = set()    
links = [....]
    while links:
         if link_for_fetch in visited:
              continue
         link_for_fetch = links.pop()
         content = get_contents(link_for_fetch)
         headlines += parse_headlines()
         links += parse_links()
         visited.add(link_for_fetch)

it's just pseudocode, you can write in any programming language. but this can take a lot of time for parsing whole site :( and robots can block your ip address



来源:https://stackoverflow.com/questions/64804060/accessing-all-elements-from-main-website-page-with-beautiful-soup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!