Tags architecture | 易学教程

问题

I am building a multi-site platform, similar to StackExchange in the way that it has several communities using the same platform and sharing data.

Users can "tag" their content the same way you can tag a stack exchange question. What is the best architecture to create a tag concept?

Some small notes I thought about was the concept of aliases (synonyms). Also I thought on one hand I want tags to be shared across sites (so one can see content from another site on the same topic), but on the other hand the context could be different in different communities. For example a "graph" in computers is a data structure, while math it's something else (just a random example on top of my head - not sure if it matters).

Also if I have a community in English and one in French...

What do you think?

回答1:

I would suggest a model like this:

You keep a list of available tags, these are applied to whatever items you're tagging using a standard many-to-many intersection table.

To manage synonyms of tags, use an involuted relationship on the available tags table. This assumes that of the various tags which are synonyms, one is considered the "main" tag.

The available tags have a language flag to indicate English or French. If you're doing this for the Canadian government or something and need to ensure that everything appears in both languages, you could add an involuted one-to-one (not shown) on AVAILABLE_TAG to link the equivalent English and French tags.

To share tags across sites, use another many-to-many intersection with a SITE table to show which tags belong to which site(s). I would avoid sharing tags across sites if those tags mean different things on each site.

回答2:

To do it in a fully normalized way, you'd need something like this:

The MEANING_ITEM has the following indexes:

{SITE_ID, MEANING_NO, ITEM_NO} - automatically crated for the primary key and enables efficient search for items with given tags.
{ITEM_NO, SITE_ID, MEANING_NO} - enables efficiently querying for the opposite: "get tags of the given item".

NOTE: If your DBMS supports it, consider clustering this table. Secondary indexes in clustered tables can be expensive (since they need to contain the copy of the whole PK and may lead to double-lookup), but in this case both indexes contain same fields (so all "extra" fields are already in the secondary index) and there are no fields outside indexes, so there is no need for the double-lookup. By clustering, you are simply eliminating the (useless) table heap and you are left with just the two B-trees.

This model has the following properties:

Both tags and items are identified in a site-specific manner and you query for site-specific tags by default. If you want to query on tag name irrespective of the site, simply omit SITE_ID = ... from the WHERE clause in the query below. Since TAG_NAME is at TAG PK's leading edge, site-less queries can be satisfied efficiently without additional index.
Items cannot be tagged with tags from a "wrong" site. We are using the identifying relationships, which propagate SITE_ID down both edges of the "diamond-shaped" dependency, to be merged at the bottom of the "diamond" (in MEANING_ITEM), which is what gives us this guarantee.
Tag synonyms are represented efficiently (tags that have the same meaning within the same site are considered synonyms). There is no room for various anomalies that could have happened if we attempted to implement M:N self-relationship on tags.¹
Since the meaning of tags is site-specific, the synonyms are site-specific as well.
The MEANING table is a natural place to store additional information about tags (such as description), that will be shared by all synonyms.

¹ How would we handle synonym transitivity? If A, B and C are synonyms, do we just store A-B and B-C or we also store A-C? How do we enforce it? If we don't enforce it, we'd need some sort of recursive query to pick all dependencies. And we'll need a row per each connection, wasting space and performance.

To get items with any of the given tags, you'll need to execute a query similar to this...

SELECT *
FROM ITEM
WHERE EXISTS (
    SELECT *
    FROM TAG JOIN MEANING_ITEM ON
        TAG.SITE_ID = MEANING_ITEM.SITE_ID
        AND TAG.MEANING_NO = MEANING_ITEM.MEANING_NO
    WHERE
        TAG.SITE_ID = <site id>
        AND TAG.NAME IN ( <list of tags> )
        AND ITEM.SITE_ID = MEANING_ITEM.SITE_ID
        AND ITEM.ITEM_NO = MEANING_ITEM.ITEM_NO
)

NOTE: We can completely omit the JOIN to MEANING from the above query - all fields needed for JOIN are already in TAG.

For items that have all the given tags, you'd need some COUNTing, similar to this:

SELECT *
FROM ITEM
WHERE <number of tags> = (
    SELECT COUNT(DISTINCT TAG_NAME)
    FROM TAG JOIN MEANING_ITEM ON
        TAG.SITE_ID = MEANING_ITEM.SITE_ID
        AND TAG.MEANING_NO = MEANING_ITEM.MEANING_NO
    WHERE
        TAG.SITE_ID = <site id>
        AND TAG.NAME IN ( <list of tags> )
        AND ITEM.SITE_ID = MEANING_ITEM.SITE_ID
        AND ITEM.ITEM_NO = MEANING_ITEM.ITEM_NO
)

Now this looks like a lot of JOIN-ing, but this model is excellent for clustered (aka. index-organized) tables and for covering queries with indexes.

You'd probably need to approach the real StackExchange's amount of data before considering denormalizing this design for performance reasons (by, say, removing the junction table and limiting the number of tags per item).

In any case, measure on realistic amounts of data before committing to any particular design.

来源：https://stackoverflow.com/questions/11450654/tags-architecture

标签

database-design

architecture