I am scraping a website using Scrapy which require cooking and java-script to be enabled. I don\'t think I will have to actually process javascript. All I need is to pretend as
You should try Splash JS engine with scrapyjs. Here is a example of how to set it up in your spider project:
SPLASH_URL = 'http://192.168.59.103:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
Scraping hub which is the same company behind Scrapy, has special instances to run your spiders with splash enabled.
Then yield SplashRequest
instead of Request
in your spider like this:
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# …