Python docs states that uuid1 uses current time to form the uuid value. But I could not find a reference that ensures UUID1 is sequential.
>>> impor
Argumentless use of uuid.uuid1()
gives non-sequential results (see answer by @basil-bourque), but it can be easily made sequential if you set clock_seq
or node
arguments (because in this case uuid1
uses python implementation that guarantees to have unique and sequential timestamp
part of the UUID in current process):
import time
from uuid import uuid1, getnode
from random import getrandbits
_my_clock_seq = getrandbits(14)
_my_node = getnode()
def sequential_uuid(node=None):
return uuid1(node=node, clock_seq=_my_clock_seq)
def alt_sequential_uuid(clock_seq=None):
return uuid1(node=_my_node, clock_seq=clock_seq)
if __name__ == '__main__':
from itertools import count
old_n = uuid1() # "Native"
old_s = sequential_uuid() # Sequential
native_conflict_index = None
t_0 = time.time()
for x in count():
new_n = uuid1()
new_s = sequential_uuid()
if old_n > new_n and not native_conflict_index:
native_conflict_index = x
if old_s >= new_s:
print("OOops: non-sequential results for `sequential_uuid()`")
break
if (x >= 10*0x3fff and time.time() - t_0 > 30) or (native_conflict_index and x > 2*native_conflict_index):
print('No issues for `sequential_uuid()`')
break
old_n = new_n
old_s = new_s
print(f'Conflicts for `uuid.uuid1()`: {bool(native_conflict_index)}')
print(f"Tries: {x}")
BUT if you are running some parallel processes on the same machine, then:
node
which defaults to uuid.get_node()
will be the same for all the processes;clock_seq
has small chance to be the same for some processes (chance of 1/16384)That might lead to conflicts! That is general concern for using
uuid.uuid1
in parallel processes on the same machine unless you have access to SafeUUID from Python3.7.
If you make sure to also set node
to unique value for each parallel process that runs this code, then conflicts should not happen.
Even if you are using SafeUUID, and set unique node
, it's still possible to have non-sequential ids if they are generated in different processes.
If some lock-related overhead is acceptable, then you can store clock_seq
in some external atomic storage (for example in "locked" file) and increment it with each call: this allows to have same value for node
on all parallel processes and also will make id-s sequential. For cases when all parallel processes are subprocesses created using multiprocessing
: clock_seq
can be "shared" using multiprocessing.Value
I stumbled upon a probable answer in Cassandra/Python from http://doanduyhai.wordpress.com/2012/07/05/apache-cassandra-tricks-and-traps/
Cassandra provides, among all the primitive types, support for UUID values of type 1 (time and server based) and type 4 (random).
The primary use of UUID (Unique Universal IDentifier) is to obtain a really unique identifier in a potentially distributed environment.
Cassandra does support version 1 UUID. It gives you an unique identifier by combining the computer’s MAC address and the number of 100-nanosecond intervals since the beginning of the Gregorian calendar.
As you can see the precision is only 100 nanoseconds, but fortunately it is mixed with a clock sequence to add randomness. Furthermore the MAC address is also used to compute the UUID so it’s very unlikely that you face collision on one cluster of machine, unless you need to process a really really huge volume of data (don’t forget, not everyone is Twitter or Facebook).
One of the most relevant use case for UUID, and espcecially TimeUUID, is to use it as column key. Since Cassandra column keys are sorted, we can take advantage of this feature to have a natural ordering for our column families.
The problem with the default com.eaio.uuid.UUID provided by the Hector client is that it’s not easy to work with. As an ID you may need to bring this value from the server up to the view layer, and that’s the gotcha.
Basically, com.eaio.uuid.UUID overrides the toString() to gives a String representation of the UUID. However this String formatting cannot be sorted lexicographically…
Below are some TimeUUID generated consecutively:
8e4cab00-c481-11e1-983b-20cf309ff6dc at some t1
2b6e3160-c482-11e1-addf-20cf309ff6dc at some t2 with t2 > t1
“2b6e3160-c482-11e1-addf-20cf309ff6dc”.compareTo(“8e4cab00-c481-11e1-983b-20cf309ff6dc”)
gives -6 meaning that “2b6e3160-c482-11e1-addf-20cf309ff6dc” is less/before “8e4cab00-c481-11e1-983b-20cf309ff6dc” which is incorrect.
The current textual display of TimeUUID is split as follow:
time_low – time_mid – time_high_and_version – variant_and_sequence – node
If we re-order it starting with time_high_and_version, we can then sort it lexicographically:
time_high_and_version – time_mid – time_low – variant_and_sequence – node
The utility class is given below:
public static String reorderTimeUUId(String originalTimeUUID)
{
StringTokenizer tokens = new StringTokenizer(originalTimeUUID, "-");
if (tokens.countTokens() == 5)
{
String time_low = tokens.nextToken();
String time_mid = tokens.nextToken();
String time_high_and_version = tokens.nextToken();
String variant_and_sequence = tokens.nextToken();
String node = tokens.nextToken();
return time_high_and_version + '-' + time_mid + '-' + time_low + '-' + variant_and_sequence + '-' + node;
}
return originalTimeUUID;
}
The TimeUUIDs become:
11e1-c481-8e4cab00-983b-20cf309ff6dc
11e1-c482-2b6e3160-addf-20cf309ff6dc
Now we get:
"11e1-c481-8e4cab00-983b-20cf309ff6dc".compareTo("11e1-c482-2b6e3160-addf-20cf309ff6dc") = -1
No, standard UUIDs are not meant to be sequential.
Apparently some attempts were made with GUIDs (Microsoft's twist on UUIDs) to make them sequential to help with performance in certain database scenarios. But being sequential is not the intent of UUIDs. http://en.wikipedia.org/wiki/Globally_unique_identifier
No, in standard UUIDs, the MAC address is not the first component. The MAC address is the last component in a Version 1 UUID. http://en.wikipedia.org/wiki/Universally_unique_identifier
The various versions of UUIDs are meant to be compatible with each other. So it may be unreasonable to expect that you always have Version 1 UUIDs. Other programmers may use other versions.
Read the UUID spec, RFC 4122, by the IETF. Only a dozen pages long.
But not always:
>>> def test(n):
... old = uuid.uuid1()
... print old
... for x in range(n):
... new = uuid.uuid1()
... if old >= new:
... print "OOops"
... break
... old = new
... print new
>>> test(1000000)
fd4ae687-3619-11e1-8801-c82a1450e52f
OOops
00000035-361a-11e1-bc9f-c82a1450e52f
From the python UUID docs:
Generate a UUID from a host ID, sequence number, and the current time. If node is not given, getnode() is used to obtain the hardware address. If clock_seq is given, it is used as the sequence number; otherwise a random 14-bit sequence number is chosen.
From this, I infer that the MAC address is first, then a (possibly random) sequence number, then the current time. So I would not expect these to be guaranteed to be monotonically increasing, even for UUIDs generated by the same machine/process.