I have recently been learning Python and am dipping my hand into building a web-scraper. It\'s nothing fancy at all; its only purpose is to get the data off of a betting we
Here is a simple example of scrapy
with an AJAX request. Let see the site rubin-kazan.ru.
All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, ...):
When I analyze the source code of the page I can't see all these messages because the web page uses AJAX technology. But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:
It doesn't reload the whole page but only the parts of the page that contain messages. For this purpose I click an arbitrary number of page on the bottom:
And I observe the HTTP request that is responsible for message body:
After finish, I analyze the headers of the request (I must quote that this URL I'll extract from source page from var section, see the code below):
And the form data content of the request (the HTTP method is "Post"):
And the content of response, which is a JSON file:
Which presents all the information I'm looking for.
From now, I must implement all this knowledge in scrapy. Let's define the spider for this purpose:
class spider(BaseSpider):
name = 'RubiGuesst'
start_urls = ['http://www.rubin-kazan.ru/guestbook.html']
def parse(self, response):
url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)
yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,
formdata={'page': str(page + 1), 'uid': ''})
def RubiGuessItem(self, response):
json_file = response.body
In parse
function I have the response for first request.
In RubiGuessItem
I have the JSON file with all information.