I am using Ruby on Rails 3.2.2 and MySQL. I would like to know if it is \"advisable\" / \"desirable\" to store in a database table related to a class all records related to
If there really is the prospect of "a big database table containing billions billions billions of rows" then perhaps you should craft a solution for your specific needs around a (relatively) sparsely populated table.
Large database tables pose a significant performance challange in how quickly the system can locate the relevant row or rows. Indexes and primary keys are really needed here; however they add to the storage requirements and also require CPU cycles to be maintained as records are added, updated, and deleted. Evenso, heavy-duty database systems also have partitioning features (see http://en.wikipedia.org/wiki/Partition_(database) ) that address such row location performance issues.
A sparsely populated table can probably serve the purpose assuming some (computable or constant) default can be used whenever no rows are returned. Insert rows only wherever something other than the default is required. A sparsely populated table will require much less storage space and the system will be able to locate rows more quickly. (The use of user-defined functions or views may help keep the querying straightforward.)
If you really cannot make a sparsely populated table work for you, then you are quite stuck. Perhaps you can make that huge table into a collection of smaller tables, though I doubt that's of any help if your database system supports partitioning. Besides, a collection of smaller tables makes for messier querying.
So let's say you have millions or billions of Users who or may not have certain privileges regarding the millions or billions of Articles in your system. What, then, at the business level determines what a User is privileged to do with a given Article? Must the User be a (paying) subscriber? Or may he or she be a guest? Does the User apply (and pay) for a package of certain Articles? Might a User be accorded the privilege of editing certain Articles? And so on and so forth.
So let's say a certain User wants to do something with a certain Article. In the case of a sparsely populated table, a SELECT
on that grand table UsersArticles will either return 1 row or none. If it returns a row, then one immediately knows the ArticleUserAuthorization, and can proceed with the rest of the operation.
If no row, then maybe it's enough to say the User cannot do anything with this Article. Or maybe the User is a member of some UserGroup that is entitled to certain privileges to any Article that has some ArticleAttribute (which this Article has or has not). Or maybe the Article has a default ArticleUserAuthorization (stored in some other table) for any User that does not have such a record already in UsersArticles. Or whatever...
The point is that many situations have a structure and a regularity that can be used to help reduce the resources needed by a system. Human beings, for instance, can add two numbers with up to 6 digits each without consulting a table of over half a trillion entries; that's taking advantage of structure. As for regularity, most folks have heard of the Pareto principle (the "80-20" rule - see http://en.wikipedia.org/wiki/Pareto_principle ). Do you really need to have "billions billions billions of rows"? Or would it be truer to say that about 80% of the Users will each only have (special) privileges for maybe hundreds or thousands of the Articles - in which case, why waste the other "billions billions billions" (rounded :-P).
The fact of the matter is that if you want article-level permissions per user then you need a way to relate User
s to the Article
s they can access. This neccesitates a minimum you need N*A (where A is the number of uniquely permissioned articles).
The 3NF approach to this would be, as you suggested, to have a UsersArticles
set... which would be a very large table (as you noted).
Consider that this table would be accessed a whole lot... This seems to me like one of the situations in which a slightly denormalized approach (or even noSQL) is more appropriate.
Consider the model that Twitter uses for their user follower tables:
Jeff Atwood on the subject
And High Scalability Blog
A sample from those pieces is a lesson learned at Twitter that querying followers from a normalized table puts tremendous stress on a Users
table. Their solution was to denormalize followers so that a user's follower's are stored on their individual user settings.
Denormalize a lot. Single handedly saved them. For example, they store all a user IDs friend IDs together, which prevented a lot of costly joins. - Avoid complex joins. - Avoid scanning large sets of data.
I imagine a similar approach could be used to serve article permissions and avoid a tremendously stressed UsersArticles
single table.
Shard the ArticleUserAuthorization table by user_id. The principle is to reduce the effective dataset size on the access path. Some data will be accessed more frequently than others, also it be be accessed in a particular way. On that path the size of the resultset should be small. Here we do that by having a shard. Also, optimize that path more by maybe having an index if it is a read workload, cache it etc
This particular shard is useful if you want all the articles authorized by a user.
If you want to query by article as well, then duplicate the table and shard by article_id as well. When we have this second sharding scheme, we have denormalized the data. The data is now duplicated and the application would need to do extra work to maintain data-consistency. Writes also will be slower, use a queue for writes
Problem with sharding is that queries across shards is ineffectve, you will need a separate reporting database. Pick a sharding scheme and think about recomputing shards.
For truly massive databases, you would want to split it across physical machines. eg. one or more machines per user's articles.
some nosql suggestions are:
all this depends on the size of your database and the types of queries
EDIT: modified answer. the question previously had 'had_one' relationships Also added nosql suggestions 1 & 2
Reading through all the comments and the question I still doubt the validity of storing all the combinations. Think about the question in another way - who will populate that table? The author of the article or moderator, or someone else? And based on what rule? You wound imagine how difficult that is. It's impossible to populate all the combinations.
Facebook has a similar feature. When you write a post, you can choose who do you want to share it with. You can select 'Friends', 'Friends of Friends', 'Everyone' or custom list. The custom list allows you to define who will be included and excluded. So same as that, you only need to store the special cases, like 'include' and 'exclude', and all the remaining combinations fall into the default case. By dong this, N*M could be reduced significantly.
First of all, it is good to think about default values and behaviors and not store them in the database. For example, if by default, a user cannot read an article unless specified, then, it does not have to be stored as false
in the database.
My second thought is that you could have a users_authorizations
column in your articles
table and a articles_authorizations
in your users
table. Those 2 columns would store user ids and article ids in the form 3,7,65,78,29,78
. For the articles
table for example, this would mean users with ids 3,7,65,78,29,78
can access the articles. Then you would have to modify your queries to retrieve users that way:
@article = Article.find(34)
@users = User.find(@article.user_authorizations.split(','))
Each time an article and a user is saved or destroyed, you would have to create callbacks to update the authorization columns.
class User < ActiveRecord
after_save :update_articles_authorizations
def update_articles_authorizations
#...
end
end
Do the same for Article
model.
Last thing: if you have different types of authorizations, don't hesitate creating more columns like user_edit_authorization
.
With these combined techniques, the quantity of data and hits to the DB are minimal.
You don't have to re-invent the wheel. ACL(Access Control List) frameworks deals with same kind of problem for ages now, and most efficiently if you ask me. You have resources (Article) or even better resource groups (Article Category/Tag/Etc).On the other hand you have users (User) and User Groups. Then you would have a relatively small table which maps Resource Groups to User Groups. And you would have another relatively small table which holds exceptions to this general mapping. Alternatively you can have rule sets to satify for accessing an article.You can even have dynamic groups like : authors_friends depending on your user-user relation.
Just take a look at any decent ACL framework and you would have an idea how to handle this kind of problem.