Trying to get my head around Scrapy but hitting a few dead ends.
I have a 2 Tables on a page and would like to extract the data from each one then move along to the
You need to slightly correct your code. Since you already select all elements within the table you don't need to point again to a table. Thus you can shorten your xpath to something like thistd[1]//text()
.
def parse_products(self, response):
products = response.xpath('//*[@id="Year1"]/table//tr')
# ignore the table header row
for product in products[1:]
item = Schooldates1Item()
item['hol'] = product.xpath('td[1]//text()').extract_first()
item['first'] = product.xpath('td[2]//text()').extract_first()
item['last'] = product.xpath('td[3]//text()').extract_first()
yield item
Edited my answer since @stutray provide the link to a site.
I got it working with these xpaths for the html source you've provided:
products = sel.xpath('//*[@id="Y1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[1]/text()').extract_first()
item['last'] = p.xpath('td[1]/text()').extract_first()
yield item
Above assumes that each table row contains 1 item.
You can use CSS Selectors instead of xPaths, I always find CSS Selectors easy.
def parse_products(self, response):
for table in response.css("#Y1 table")[1:]:
item = Schooldates1Item()
item['hol'] = product.css('td:nth-child(1)::text').extract_first()
item['first'] = product.css('td:nth-child(2)::text').extract_first()
item['last'] = product.css('td:nth-child(3)::text').extract_first()
yield item
Also do not use tbody
tag in selectors. Source:
Firefox, in particular, is known for adding elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use in your XPath expressions.