问题
Help! Reading the following scrapy code and the result of crawler. I want to crawl some data from http://china.fathom.info/data/data.json, and only Scrapy is allowed. But I don't know how to control the order of yield. I look forward to process all parse_member request in the loop and then return the group_item, but seems yield item is always executed before yield request.
start_urls = [
"http://china.fathom.info/data/data.json"
]
def parse(self, response):
groups = json.loads(response.body)['group_members']
for i in groups:
group_item = GroupItem()
group_item['name'] = groups[i]['name']
group_item['chinese'] = groups[i]['chinese']
group_item['members'] = []
members = groups[i]['members']
for member in members:
yield Request(self.person_url % member['id'], meta={'group_item': group_item, 'member': member},
callback=self.parse_member, priority=100)
yield group_item
def parse_member(self, response):
group_item = response.meta['group_item']
member = response.meta['member']
person = json.loads(response.body)
ego = person['ego']
group_item['members'].append({
'id': ego['id'],
'name': ego['name'],
'chinese': ego['chinese'],
'role': member['role']
})
Data on MongoDB
回答1:
you need to yield the item on the final callback, parse
isn't stopping for parse_member
to finish, so the group_item
in parse
isn't changing while parse_member
is working.
Don't yield the group_item
of parse
, just the one on parse_member
, as you already copied the previous item on meta
and you already recovered it on parse_member
with response.meta['group_item']
来源:https://stackoverflow.com/questions/33875339/how-to-control-the-order-of-yield-in-scrapy