Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.
So here I go again...
Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.
Regarding the OS projects recommended by @Grant:
- *DotNetKicks uses the DB for tagging and Lucene for full-text search. There appears to be no way to combine a full text search with a tag search
- Kigg uses Linq-to-SQL for both search and tag queries. Both queries join Stories->StoryTags->Tags.
- Both projects have a 3-table approach to tagging as everyone generally seems to recommend
I also found some other questions on SO that I'd missed before:
- How Do You Recommend Implementing Tags or Tagging?
- How to structure data for searchability?
- Database Design for Tagging
What I'm currently doing for each of the items I mentioned:
- In the DB, 3 tables: Entity, Tag, Entity_Tag. I use the DB to:
- Build site-wide tag clouds
- browse by tag (i.e. urls like SO's /questions/tagged/ASP.NET)
- For search I use Lucene + NHibernate.Search
- Tags are concat'd into a TagString that is indexed by Lucene
- So I have the full power of the Lucene query engine (AND / OR / NOT queries)
- I can search for text and filter by tags at the same time
- The Lucene analyzer merges words for better tag searches (i.e. a tag search for "test" will also find stuff tagged "testing")
- Lucene returns a potentially enormous result set, which I paginate to 20 results
- Then NHibernate loads the result Entities by Id, either from the DB or the Entity cache
- So it's entirely possible that a search results in 0 hits to the DB
- Not doing this yet, but I think I will probably try to find a way to build the tag cloud from the TagString in Lucene, rather than take another DB hit
- Haven't done this yet either, but I will probably store the TagString in the DB so that I can show an Entity's Tag list without having to make 2 more joins.
This means that whenever an Entity's tags are modified, I have to:
- Insert any new Tags that do not already exist
- Insert/Delete from the EntityTag table
- Update Entity.TagString
- Update the Lucene index for the Entity
Given that the ratio of reads to writes is very big in my application, I think I'm ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I'm not excited about that, but I think that if I do it in a background thread, it will be fine.
Time will tell...