Optimal data architecture for tagging, clouds, and searching (like StackOverflow)?

前端 未结 2 1834
南笙
南笙 2021-01-29 17:47

I\'d love to know how Stack Overflow\'s tagging and search is architected, because it seems to work pretty well.

What is a good database/search model if I want to do a

2条回答
  •  长情又很酷
    2021-01-29 18:10

    Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.

    So here I go again...

    Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.

    Regarding the OS projects recommended by @Grant:

    • *DotNetKicks uses the DB for tagging and Lucene for full-text search. There appears to be no way to combine a full text search with a tag search
    • Kigg uses Linq-to-SQL for both search and tag queries. Both queries join Stories->StoryTags->Tags.
    • Both projects have a 3-table approach to tagging as everyone generally seems to recommend

    I also found some other questions on SO that I'd missed before:

    • How Do You Recommend Implementing Tags or Tagging?
    • How to structure data for searchability?
    • Database Design for Tagging

    What I'm currently doing for each of the items I mentioned:

    1. In the DB, 3 tables: Entity, Tag, Entity_Tag. I use the DB to:
      • Build site-wide tag clouds
      • browse by tag (i.e. urls like SO's /questions/tagged/ASP.NET)
    2. For search I use Lucene + NHibernate.Search
      • Tags are concat'd into a TagString that is indexed by Lucene
        • So I have the full power of the Lucene query engine (AND / OR / NOT queries)
        • I can search for text and filter by tags at the same time
        • The Lucene analyzer merges words for better tag searches (i.e. a tag search for "test" will also find stuff tagged "testing")
      • Lucene returns a potentially enormous result set, which I paginate to 20 results
      • Then NHibernate loads the result Entities by Id, either from the DB or the Entity cache
      • So it's entirely possible that a search results in 0 hits to the DB
    3. Not doing this yet, but I think I will probably try to find a way to build the tag cloud from the TagString in Lucene, rather than take another DB hit
    4. Haven't done this yet either, but I will probably store the TagString in the DB so that I can show an Entity's Tag list without having to make 2 more joins.

    This means that whenever an Entity's tags are modified, I have to:

    • Insert any new Tags that do not already exist
    • Insert/Delete from the EntityTag table
    • Update Entity.TagString
    • Update the Lucene index for the Entity

    Given that the ratio of reads to writes is very big in my application, I think I'm ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I'm not excited about that, but I think that if I do it in a background thread, it will be fine.

    Time will tell...

提交回复
热议问题