How to scrape all the home page text content of a website?

怎甘沉沦 提交于 2020-04-17 19:06:53

问题


So I am new to webscraping, I want to scrape all the text content of only the home page.

this is my code, but it now working correctly.

from bs4 import BeautifulSoup
import requests


website_url = "http://www.traiteurcheminfaisant.com/"
ra = requests.get(website_url)
soup = BeautifulSoup(ra.text, "html.parser")

full_text = soup.find_all()

print(full_text)

When I print "full_text" it give me a lot of html content but not all, when I ctrl + f " traiteurcheminfaisant@hotmail.com" the email adress that is on the home page (footer) is not found on full_text.

Thanks you for helping!


回答1:


A quick glance at the website that you're attempting to scrape from makes me suspect that not all content is loaded when sending a simple get request via the requests module. In other words, it seems likely that some components on the site, such as the footer you mentioned, are being loaded asynchronously with Javascript.

If that is the case, you'll probably want to use some sort of automation tool to navigate to the page, wait for it to load and then parse the fully loaded source code. For this, the most common tool would be Selenium. It can be a bit tricky to set up the first time since you'll also need to install a separate webdriver for whatever browser you'd like to use. That said, the last time I set this up it was pretty easy. Here's a rough example of what this might look like for you (once you've got Selenium properly set up):

from bs4 import BeautifulSoup
from selenium import webdriver

import time

driver = webdriver.Firefox(executable_path='/your/path/to/geckodriver')
driver.get('http://www.traiteurcheminfaisant.com')
time.sleep(2)

source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')

full_text = soup.find_all()

print(full_text)



回答2:


I haven't used BeatifulSoup before, but try using urlopen instead. This will store the webpage as a string, which you can use to find the email.

from urllib.request import urlopen

try:
    response = urlopen("http://www.traiteurcheminfaisant.com")
    html = response.read().decode(encoding = "UTF8", errors='ignore')
    print(html.find("traiteurcheminfaisant@hotmail.com"))
except:
    print("Cannot open webpage")




来源:https://stackoverflow.com/questions/60471420/how-to-scrape-all-the-home-page-text-content-of-a-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!