I'm trying to figure out how to model data in Riak. Let's say you are building something like a CMS with two features, news and products. You need to be able to store this information for multiple clients X and Y. How would you typically structure this?
One bucket per client and then two keys news and products. Store multiple objects under each key and then use map/reduce to order them.
Store both the news and the products in the same bucket, but with a new autogenerated key for each news item and product item. That is, one bucket for X and one for Y.
One bucket per client/feature combination, that is, the buckets would be X-news, X-products, Y-news and Y-products. Then use map/reduce on the whole bucket to return the results in order.
Which would be the best way to handle this problem?
I'd create 2 buckets: news and products. Then I'd prefix keys in each bucket with client names. I'd probably also include dates in news keys for easy date ranging.
news/acme_2011-02-23_01 news/acme_2011-02-23_02 news/bigcorp_2011-02-21_01
And optionally prefix product names with category names
products/acme_blacksmithing_anvil products/bigcorp_databases_oracle
Then in your map/reduce you could use key filtering:
// BigCorp News items { "inputs":{ "bucket":"news", "key_filters":[["starts_with", "bigcorp"]] } // ... rest of mapreduce job } // Acme Blacksmithing items { "inputs":{ "bucket":"products", "key_filters":[["starts_with", "acme_blacksmithing"]] } // ... rest of mapreduce job } // News for all clients from Feb 12th to 19th { "inputs":{ "bucket":"news", "key_filters":[["tokenize", "_", 2], ["between", "2011-02-12", "2011-02-19"]] } // ... rest of mapreduce job }
An even more efficient approach to this than using key filtering (as per Kev Burns's recommendation) is to use Secondary Indexes or Riak Search, to model this scenario.
Take a look at my answers to Which clustered NoSQL DB for a Message Storing purpose? and Links in Riak: what can they do/not do, compared to graph databases? for a discussion of similar cases.
You have several decisions to make, depending on your use case. In all cases, you would start out with a company bucket, so that each company has a unique key.
1) Whether to store the items of interest in 2 separate buckets (news and products) or in one (something like items_of_interest) depends on your preference and ease of querying. If you're always going to be querying for both news and products for a company in a single query, you might as well store them in a single bucket. But I recommend using 2 separate ones, to keep easier track of them, especially if you'll have something like separate tabs or pages for "Company X - Products" and "Company X - News". And if you need to combine them into a single feed, you would make 2 queries (one for news and one for products), and combine them in the client code (by date or whatever).
2) If a news/product item can have one and only one company that it belongs to, create a secondary index on company_key for each item. That way, you can easily fetch all news or products for a company via a secondary index (2i) query for that company.
3) If there's a many-to-many relationship (if a news/product item can belong to several companies (perhaps the news item is about a joint venture for 2 separate companies)), then I recommend modeling the relationship as a separate Riak object. For example, you could create a mentions bucket, and for each company mentioned in a news story, you would insert a Mention object, with its own unique key, a secondary index for company_key, and the value would contain a type ('news' or 'product') and an item_key (news key or product key). Extracting relationships to separate Riak objects like this allows you to do a lot of interesting things -- tag them arbitrarily using Riak Search, query them for subscription event notifications, etc.