I want to scrape a page of data (using the Python Scrapy library) without having to define each individual field on the page. Instead I want to dynamically generate fields using
This solution works with the exporters (scrapy crawl -t json -o output.json
):
import scrapy
class FlexibleItem(scrapy.Item):
def __setitem__(self, key, value):
if key not in self.fields:
self.fields[key] = scrapy.Field()
super(FlexibleItem, self).__setitem__(key, value)
EDIT: updated to work with latest Scrapy
This works with version 0.24 and also allows Items to work with Item Loaders:
import scrapy
from collections import defaultdict
class FlexibleItem(scrapy.Item):
fields = defaultdict(scrapy.Field)
def __setitem__(self, key, value):
# all keys are supported
self._values[key] = value
I know that my answer is late, but for those who still need a dynamic items using Scrapy (current version is 1), I created a repository on Github including an example.
Here you go
https://github.com/WilliamKinaan/ScrapyDynamicItems
The old method didn't work with item loaders and was complicating things unnecessarily. Here's a better way of achieving a flexible item:
from scrapy.item import BaseItem
from scrapy.contrib.loader import ItemLoader
class FlexibleItem(dict, BaseItem):
pass
if __name__ == '__main__':
item = FlexibleItem()
loader = ItemLoader(item)
loader.add_value('foo', 'bar')
loader.add_value('baz', 123)
loader.add_value('baz', 'test')
loader.add_value(None, {'abc': 'xyz', 'foo': 555})
print loader.load_item()
if 'meow' not in item:
print "it's not a cat!"
Result:
{'foo': ['bar', 555], 'baz': [123, 'test'], 'abc': ['xyz']} it's not a cat!
Okay, I've found a solution. It's a bit of "hack" but it works..
A Scrapy Item stores the field names in a dict called fields
. When adding data to an Item it checks if the field exists, and if it doesn't it throws and error:
def __setitem__(self, key, value):
if key in self.fields:
self._values[key] = value
else:
raise KeyError("%s does not support field: %s" %\
(self.__class__.__name__, key))
What you can do is override this __setitem__
function to be less strict:
class FlexItem(Item):
def __setitem__(self, key, value):
if key not in self.fields:
self.fields[key] = Field()
self._values[key] = value
And there you go.
Now when you add data to an Item, if the item doesn't have that field defined, it will be added, and then the data will be added as normal.