I\'d love to know how Stack Overflow\'s tagging and search is architected, because it seems to work pretty well.
What is a good database/search model if I want to do a
Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.
So here I go again...
Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.
Regarding the OS projects recommended by @Grant:
I also found some other questions on SO that I'd missed before:
What I'm currently doing for each of the items I mentioned:
This means that whenever an Entity's tags are modified, I have to:
Given that the ratio of reads to writes is very big in my application, I think I'm ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I'm not excited about that, but I think that if I do it in a background thread, it will be fine.
Time will tell...
I don't know if they qualify as optimal, but both DotNetKicks and Kigg are open source digg clone implementations. You can look at how they're doing tags and search.
My best guesses without a lot of deliberation :)
So my initial take is probably Entity -> EntityTag <- Tag.
This approach makes finding items via Tag pretty easy, join back through EntityTag, call it a day.
You need a secondary operation here to select the distinct tags for the result set. So a.) pull the result set, b.) normalize the tag space. I think you do this no matter what the answer is to #1 -- even stuffing tags into one field will still yield duplicate tags (and you have to deserialize them to perform this op--so more work, another argument for a fully relational approach).
Still easy. Here's one area where the serialized approach works better. No need to join for child tags, it's right there in the Entity. That said, pulling out 0..n tags via the two table join doesn't seem too challenging to me. If you're talking perf considerations, build it normalized first then optimize via cache or denorm.
The other option is "do both". This feels like a premature optimization, but you could do the full normalized approach to support any tag-centric operations and serialize upon persist to have a denormalized version right there in the Entity. A bit more work, some potential to fall out of synch if not fully covered, but best of both worlds if there's real limitations to the fully normalized way in your use cases.
Lucene is interesting as well, you can declare specific metadata in the indices IIRC, so you could potentially leverage tag search this way as well. My suspicion is, if you go too far down this road, then you end up having some disconnects between what you store in the database and the index at some point. I can speak favorably about Lucene, it's very capable and easy to use--I believe .Text used it for it's search capabilities and it supported all of weblogs.asp.net prior to it switching over to Community Server. I'd stick to it for full-text search if MSSQL isn't in the picture/sufficient, solve the tag issues in the database imo.