Redis sorted sets and best way to store uids

后端 未结 1 827
别跟我提以往
别跟我提以往 2020-12-09 23:10

I have data consisting of user_ids and tags of these user ids. The user_ids occur multiple times and have pre-specified number of tags (500) however that might change in the

相关标签:
1条回答
  • 2020-12-09 23:51

    My first point would be to note that 4 GB are tight to store 20M sorted sets. A quick try shows that 20M users, each of them with 20 tags would take about 8 GB on a 64 bits box (and it accounts for the sorted set ziplist memory optimizations provided with Redis 2.4 - don't even try this with earlier versions).

    Sorted sets are the ideal data structure to support your use case. I would use them exactly as you described.

    As you pointed out, KEYS cannot be used to iterate on keys. It is rather meant as a debug command. To support key iteration, you need to add a data structure to provide this access path. The only structures in Redis which can support iteration are the list and the sorted set (through the range methods). However, they tend to transform O(n) iteration algorithms into O(n^2) (for list), or O(nlogn) (for zset). A list is also a poor choice to store keys since it will be difficult to maintain it as keys are added/removed.

    A more efficient solution is to add an index composed of regular sets. You need to use a hash function to associate a specific user to a bucket, and add the user id to the set corresponding to this bucket. If the user id are numeric values, a simple modulo function will be enough. If they are not, a simple string hashing function will do the trick.

    So to support iteration on user:1000, user:2000 and user:1001, let's choose a modulo 1000 function. user:1000 and user:2000 will be put in bucket index:0 while user:1001 will be put in bucket index:1.

    So on top of the zsets, we now have the following keys:

    index:0 => set[ 1000, 2000 ]
    index:1 => set[ 1001 ]
    

    In the sets, the prefix of the keys is not needed, and it allows Redis to optimize the memory consumption by serializing the sets provided they are kept small enough (integer sets optimization proposed by Sripathi Krishnan).

    The global iteration consists in a simple loop on the buckets from 0 to 1000 (excluded). For each bucket, the SMEMBERS command is applied to retrieve the corresponding set, and the client can then iterate on the individual items.

    Here is an example in Python:

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    # ----------------------------------------------------
    
    import redis, random
    
    POOL = redis.ConnectionPool(host='localhost', port=6379, db=0)
    
    NUSERS = 10000
    NTAGS = 500
    NBUCKETS = 1000
    
    # ----------------------------------------------------
    # Fill redis with some random data
    
    def fill(r):
      p = r.pipeline()
      # Create only 10000 users for this example
      for id in range(0,NUSERS):
        user = "user:%d" % id
        # Add the user in the index: a simple modulo is used to hash the user id
        # and put it in the correct bucket
        p.sadd( "index:%d" % (id%NBUCKETS), id )
        # Add random tags to the user
        for x in range(0,20):
          tag = "tag:%d" % (random.randint(0,NTAGS))
          p.zincrby( user, tag, 1 )
        # Flush the pipeline every 1000 users
        if id % 1000 == 0:
          p.execute()
          print id
      # Flush one last time
      p.execute()
    
    # ----------------------------------------------------
    # Iterate on all the users and display their 5 highest ranked tags
    
    def iterate(r):
      # Iterate on the buckets of the key index
      # The range depends on the function used to hash the user id
      for x in range(0,NBUCKETS):
        # Iterate on the users in this bucket
        for id in r.smembers( "index:%d"%(x) ):
          user = "user:%d" % int(id)
          print user,r.zrevrangebyscore(user,"+inf","-inf", 0, 5, True )
    
    # ----------------------------------------------------
    # Main function
    
    def main():
      r = redis.Redis(connection_pool=POOL)
      r.flushall()
      m = r.info()["used_memory"]
      fill(r)
      info = r.info()
      print "Keys: ",info["db0"]["keys"]
      print "Memory: ",info["used_memory"]-m
      iterate(r)
    
    # ----------------------------------------------------
    
    main()
    

    By tweaking the constants, you can also use this program to evaluate the global memory consumption of this data structure.

    IMO this strategy is simple and efficient, because it offers O(1) complexity to add/remove users, and true O(n) complexity to iterate on all items. The only downside is the key iteration order is random.

    0 讨论(0)
提交回复
热议问题