问题
I'm unable to scrape images from the website www.kissmanga.com . I'm using Python3 and the Requests and Beautifulsoup libraries. The scraped image tags give blank "src".
SRC:
from bs4 import BeautifulSoup
import requests
scraper = cfscrape.create_scraper()
url = "http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206"
response = requests.get(url)
soup2 = BeautifulSoup(response.text, 'html.parser')
divImage = soup2.find('div',{"id": "divImage"})
for img in divImage.findAll('img'):
print(img)
response.close()
I think image scraping is prevented because I believe the website uses cloudflare. Upon this assumption, I also tried using the "cfscrape" library to scrape the content.
回答1:
You need to wait for JavaScript
to inject the html
code for images.
Multiple tools are capable of doing this, here are some of them:
- Ghost
- PhantomJS (Ghost Driver)
- Selenium
I was able to get it working with Selenium:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
driver = webdriver.Firefox()
# it takes forever to load the page, therefore we are setting a threshold
driver.set_page_load_timeout(5)
try:
driver.get("http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206")
except TimeoutException:
# never ignore exceptions silently in real world code
pass
soup2 = BeautifulSoup(driver.page_source, 'html.parser')
divImage = soup2.find('div', {"id": "divImage"})
# close the browser
driver.close()
for img in divImage.findAll('img'):
print img.get('src')
Refer to How to download image using requests if you also want to download these images.
回答2:
Have you tried setting a custom user-agent? It's typically considered unethical to do so, but so is scraping manga.
来源:https://stackoverflow.com/questions/31419641/python-scraper-unable-to-scrape-img-src