Having just recently delved into the world of NoSQL with MongoDB, I am still struggling to understand the best approach to architecture without 3rd normalizing the data and then
You need to consider the type of queries you will need to perform and how frequently each type will be needed. When I was working on something similar, I came up with six possible actions:
You'll want to estimate how important each of these is to your application.
If most of your work involves working with stored data for some given article including its immediate parent and children, the first idea is most useful. Indeed in MongoDB, it is quite common to place all the information you need in the same document rather than referencing it externally so that you only need to retrieve one thing and just work with that data. The last four actions in the list are more tricky though.
In particular, you will need to traverse through the tree to retrieve ancestors and descendants in this case, moving through intermediary documents and following a path, even though you may only care about the last document in the path. This can be slow for long hierarchies. Changing relationships can require moving a lot of information around in multiple documents because of all the data present in each one. But even changing a single field like "title" can be annoying, because you have to consider the fact that this field is present in multiple different documents, either as a main field or under the parent or children fields.
Basically, your first idea works best in more static applications where you won't be changing the data a lot after initially creating it, but where you need to read it regularly.
The MongoDB documentation has five recommended approaches for handling tree-like (hierarchical) structures. All of them have different advantages and disadvantages, though they all make it easy to update the main data in an article by only needing to do so in one document.
The five approaches are discussed in more detail in the MongoDB documentation.
Your second idea combines the "Parent References" and "Child References" approaches discussed above. This approach makes it easy to find both the children and the parent and makes it easy to update relationships and the main data of an article (though you need to update both the parent and the children fields), but you still need to traverse through it to find ancestors and descendants.
If you are interested in finding ancestors and descendants (and care about this more than being able to easily update relationships), you can consider adding an ancestors array to your second idea to make it also easy to query for ancestors and descendants. Of course, updating relationships becomes a real pain if you do this though.
Conclusion:
Ultimately it all depends on what actions are needed the most. Since you're working with articles, whose underlying data (like the title) can change frequently, you may want to avoid the first idea since you would need to update not only the main document for that article but all child documents as well as the parent.
Your second idea makes it easy to retrieve the immediate parent and children. Updating relationships is also not too difficult (It's certainly better than some of the other options available).
If you really want to make it easy to find ancestors and descendants at the expense of updating relationships as easily, choose to include an array of ancestor references.
In general, try to minimize the number of traversals required, as they require running some kind of iteration or recursion to get to the data you want. If you value the ability to update relationships, you should also pick an option that changes fewer nodes in the tree (Parent References, Child References, and your second idea can do this).
Very nice summary from @CynicalProgrammer, I would add one more: use the fact that json is a tree to your adventage! No need to store only the 'left' and 'right' node ID in each node, why not store a subtree, 3-4-5 deep down? So it would look like this:
{
left: {
left: {
left: {...},
right: {...}
},
right: {...}
},
right: {... you get the idea ...}
}
This way, you'd need hell of a lot less queries for traverse the tree, and the number of documents in the collection would be a fraction. A slight drawback is that you need somewhat more complex code to write to the tree, and mongodb documents will be bigger, meaning individual writes are slower.
I think this is probably the best way to store trees in mongodb. But again, remember, mongo is meant for document storage, not "big-ass schemaless tree" storage. You might want to look into something like neo4j for that.