Optimal data architecture for tagging, clouds, and searching (like StackOverflow)?

陌路散爱 提交于 2019-12-02 13:48:59
Winston Fassett

Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.

So here I go again...

Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.

Regarding the OS projects recommended by @Grant:

  • *DotNetKicks uses the DB for tagging and Lucene for full-text search. There appears to be no way to combine a full text search with a tag search
  • Kigg uses Linq-to-SQL for both search and tag queries. Both queries join Stories->StoryTags->Tags.
  • Both projects have a 3-table approach to tagging as everyone generally seems to recommend

I also found some other questions on SO that I'd missed before:

What I'm currently doing for each of the items I mentioned:

  1. In the DB, 3 tables: Entity, Tag, Entity_Tag. I use the DB to:
    • Build site-wide tag clouds
    • browse by tag (i.e. urls like SO's /questions/tagged/ASP.NET)
  2. For search I use Lucene + NHibernate.Search
    • Tags are concat'd into a TagString that is indexed by Lucene
      • So I have the full power of the Lucene query engine (AND / OR / NOT queries)
      • I can search for text and filter by tags at the same time
      • The Lucene analyzer merges words for better tag searches (i.e. a tag search for "test" will also find stuff tagged "testing")
    • Lucene returns a potentially enormous result set, which I paginate to 20 results
    • Then NHibernate loads the result Entities by Id, either from the DB or the Entity cache
    • So it's entirely possible that a search results in 0 hits to the DB
  3. Not doing this yet, but I think I will probably try to find a way to build the tag cloud from the TagString in Lucene, rather than take another DB hit
  4. Haven't done this yet either, but I will probably store the TagString in the DB so that I can show an Entity's Tag list without having to make 2 more joins.

This means that whenever an Entity's tags are modified, I have to:

  • Insert any new Tags that do not already exist
  • Insert/Delete from the EntityTag table
  • Update Entity.TagString
  • Update the Lucene index for the Entity

Given that the ratio of reads to writes is very big in my application, I think I'm ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I'm not excited about that, but I think that if I do it in a background thread, it will be fine.

Time will tell...

I don't know if they qualify as optimal, but both DotNetKicks and Kigg are open source digg clone implementations. You can look at how they're doing tags and search.

My best guesses without a lot of deliberation :)

  1. I never like the idea of serializing multiple values into a single field, so delimited strings stored in one field don't appeal to me... might work for adjacency paths with trees, but those are always ordered and tags need not be. This seems like it would tax the LIKE operator work you might do to find them.

So my initial take is probably Entity -> EntityTag <- Tag.

  1. This approach makes finding items via Tag pretty easy, join back through EntityTag, call it a day.

  2. You need a secondary operation here to select the distinct tags for the result set. So a.) pull the result set, b.) normalize the tag space. I think you do this no matter what the answer is to #1 -- even stuffing tags into one field will still yield duplicate tags (and you have to deserialize them to perform this op--so more work, another argument for a fully relational approach).

  3. Still easy. Here's one area where the serialized approach works better. No need to join for child tags, it's right there in the Entity. That said, pulling out 0..n tags via the two table join doesn't seem too challenging to me. If you're talking perf considerations, build it normalized first then optimize via cache or denorm.

The other option is "do both". This feels like a premature optimization, but you could do the full normalized approach to support any tag-centric operations and serialize upon persist to have a denormalized version right there in the Entity. A bit more work, some potential to fall out of synch if not fully covered, but best of both worlds if there's real limitations to the fully normalized way in your use cases.

Lucene is interesting as well, you can declare specific metadata in the indices IIRC, so you could potentially leverage tag search this way as well. My suspicion is, if you go too far down this road, then you end up having some disconnects between what you store in the database and the index at some point. I can speak favorably about Lucene, it's very capable and easy to use--I believe .Text used it for it's search capabilities and it supported all of weblogs.asp.net prior to it switching over to Community Server. I'd stick to it for full-text search if MSSQL isn't in the picture/sufficient, solve the tag issues in the database imo.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!