Is python uuid1 sequential as timestamps?

后端 未结 5 1226
孤城傲影
孤城傲影 2020-12-19 00:34

Python docs states that uuid1 uses current time to form the uuid value. But I could not find a reference that ensures UUID1 is sequential.

>>> impor         


        
相关标签:
5条回答
  • 2020-12-19 00:55

    Argumentless use of uuid.uuid1() gives non-sequential results (see answer by @basil-bourque), but it can be easily made sequential if you set clock_seq or node arguments (because in this case uuid1 uses python implementation that guarantees to have unique and sequential timestamp part of the UUID in current process):

    import time
    
    from uuid import uuid1, getnode
    from random import getrandbits
    
    _my_clock_seq = getrandbits(14)
    _my_node = getnode()
    
    
    def sequential_uuid(node=None):
        return uuid1(node=node, clock_seq=_my_clock_seq)
    
    
    def alt_sequential_uuid(clock_seq=None):
        return uuid1(node=_my_node, clock_seq=clock_seq)
    
    
    
    if __name__ == '__main__':
        from itertools import count
        old_n = uuid1()  # "Native"
        old_s = sequential_uuid()  # Sequential
    
        native_conflict_index = None
    
        t_0 = time.time()
    
        for x in count():
            new_n = uuid1()
            new_s = sequential_uuid()
    
            if old_n > new_n and not native_conflict_index:
                native_conflict_index = x
    
            if old_s >= new_s:
                print("OOops: non-sequential results for `sequential_uuid()`")
                break
    
            if (x >= 10*0x3fff and time.time() - t_0 > 30) or (native_conflict_index and x > 2*native_conflict_index):
                print('No issues for `sequential_uuid()`')
                break
    
            old_n = new_n
            old_s = new_s
    
        print(f'Conflicts for `uuid.uuid1()`: {bool(native_conflict_index)}')
        print(f"Tries: {x}")
    
    

    Multiple processes issues

    BUT if you are running some parallel processes on the same machine, then:

    • node which defaults to uuid.get_node() will be the same for all the processes;
    • clock_seq has small chance to be the same for some processes (chance of 1/16384)

    That might lead to conflicts! That is general concern for using uuid.uuid1 in parallel processes on the same machine unless you have access to SafeUUID from Python3.7.

    If you make sure to also set node to unique value for each parallel process that runs this code, then conflicts should not happen.

    Even if you are using SafeUUID, and set unique node, it's still possible to have non-sequential ids if they are generated in different processes.

    If some lock-related overhead is acceptable, then you can store clock_seq in some external atomic storage (for example in "locked" file) and increment it with each call: this allows to have same value for node on all parallel processes and also will make id-s sequential. For cases when all parallel processes are subprocesses created using multiprocessing: clock_seq can be "shared" using multiprocessing.Value

    0 讨论(0)
  • 2020-12-19 01:04

    I stumbled upon a probable answer in Cassandra/Python from http://doanduyhai.wordpress.com/2012/07/05/apache-cassandra-tricks-and-traps/

    Lexicographic TimeUUID ordering

    Cassandra provides, among all the primitive types, support for UUID values of type 1 (time and server based) and type 4 (random).

    The primary use of UUID (Unique Universal IDentifier) is to obtain a really unique identifier in a potentially distributed environment.

    Cassandra does support version 1 UUID. It gives you an unique identifier by combining the computer’s MAC address and the number of 100-nanosecond intervals since the beginning of the Gregorian calendar.

    As you can see the precision is only 100 nanoseconds, but fortunately it is mixed with a clock sequence to add randomness. Furthermore the MAC address is also used to compute the UUID so it’s very unlikely that you face collision on one cluster of machine, unless you need to process a really really huge volume of data (don’t forget, not everyone is Twitter or Facebook).

    One of the most relevant use case for UUID, and espcecially TimeUUID, is to use it as column key. Since Cassandra column keys are sorted, we can take advantage of this feature to have a natural ordering for our column families.

    The problem with the default com.eaio.uuid.UUID provided by the Hector client is that it’s not easy to work with. As an ID you may need to bring this value from the server up to the view layer, and that’s the gotcha.

    Basically, com.eaio.uuid.UUID overrides the toString() to gives a String representation of the UUID. However this String formatting cannot be sorted lexicographically…

    Below are some TimeUUID generated consecutively:

    8e4cab00-c481-11e1-983b-20cf309ff6dc at some t1
    2b6e3160-c482-11e1-addf-20cf309ff6dc at some t2 with t2 > t1
    

    “2b6e3160-c482-11e1-addf-20cf309ff6dc”.compareTo(“8e4cab00-c481-11e1-983b-20cf309ff6dc”) gives -6 meaning that “2b6e3160-c482-11e1-addf-20cf309ff6dc” is less/before “8e4cab00-c481-11e1-983b-20cf309ff6dc” which is incorrect.

    The current textual display of TimeUUID is split as follow:

    time_low – time_mid – time_high_and_version – variant_and_sequence – node
    

    If we re-order it starting with time_high_and_version, we can then sort it lexicographically:

    time_high_and_version – time_mid – time_low – variant_and_sequence – node
    

    The utility class is given below:

    public static String reorderTimeUUId(String originalTimeUUID)
        {
            StringTokenizer tokens = new StringTokenizer(originalTimeUUID, "-");
            if (tokens.countTokens() == 5)
            {
                String time_low = tokens.nextToken();
                String time_mid = tokens.nextToken();
                String time_high_and_version = tokens.nextToken();
                String variant_and_sequence = tokens.nextToken();
                String node = tokens.nextToken();
    
                return time_high_and_version + '-' + time_mid + '-' + time_low + '-' + variant_and_sequence + '-' + node;
    
            }
    
            return originalTimeUUID;
        }
    

    The TimeUUIDs become:

    11e1-c481-8e4cab00-983b-20cf309ff6dc
    11e1-c482-2b6e3160-addf-20cf309ff6dc
    

    Now we get:

    "11e1-c481-8e4cab00-983b-20cf309ff6dc".compareTo("11e1-c482-2b6e3160-addf-20cf309ff6dc") = -1
    
    0 讨论(0)
  • 2020-12-19 01:09

    UUIDs Not Sequential

    No, standard UUIDs are not meant to be sequential.

    Apparently some attempts were made with GUIDs (Microsoft's twist on UUIDs) to make them sequential to help with performance in certain database scenarios. But being sequential is not the intent of UUIDs. http://en.wikipedia.org/wiki/Globally_unique_identifier

    MAC Is Last, Not First

    No, in standard UUIDs, the MAC address is not the first component. The MAC address is the last component in a Version 1 UUID. http://en.wikipedia.org/wiki/Universally_unique_identifier

    Do Not Assume Which Type Of UUID

    The various versions of UUIDs are meant to be compatible with each other. So it may be unreasonable to expect that you always have Version 1 UUIDs. Other programmers may use other versions.

    Specification

    Read the UUID spec, RFC 4122, by the IETF. Only a dozen pages long.

    0 讨论(0)
  • 2020-12-19 01:11

    But not always:

    >>> def test(n):
    ...     old = uuid.uuid1()
    ...     print old
    ...     for x in range(n):
    ...             new = uuid.uuid1()
    ...             if old >= new:
    ...                     print "OOops"
    ...                     break
    ...             old = new
    ...     print new
    >>> test(1000000)
    fd4ae687-3619-11e1-8801-c82a1450e52f
    OOops
    00000035-361a-11e1-bc9f-c82a1450e52f
    
    0 讨论(0)
  • 2020-12-19 01:16

    From the python UUID docs:

    Generate a UUID from a host ID, sequence number, and the current time. If node is not given, getnode() is used to obtain the hardware address. If clock_seq is given, it is used as the sequence number; otherwise a random 14-bit sequence number is chosen.

    From this, I infer that the MAC address is first, then a (possibly random) sequence number, then the current time. So I would not expect these to be guaranteed to be monotonically increasing, even for UUIDs generated by the same machine/process.

    0 讨论(0)
提交回复
热议问题