I have a list of strings. I want to assign a unique number to each string (the exact number is not important), and create a list of the same length using these numbers, in o
Without using an external library (check the EDIT for a Pandas
solution) you can do it as follows :
d = {ni: indi for indi, ni in enumerate(set(names))}
numbers = [d[ni] for ni in names]
Brief explanation:
In the first line, you assign a number to each unique element in your list (stored in the dictionary d
; you can easily create it using a dictionary comprehension; set
returns the unique elements of names
).
Then, in the second line, you do a list comprehension and store the actual numbers in the list numbers
.
One example to illustrate that it also works fine for unsorted lists:
# 'll' appears all over the place
names = ['ll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'll', 'LL', 'HL', 'HL', 'HL', 'll']
That is the output for numbers
:
[1, 1, 3, 3, 3, 2, 2, 1, 2, 0, 0, 0, 1]
As you can see, the number 1
associated with ll
appears at the correct places.
EDIT
If you have Pandas available, you can also use pandas.factorize (which seems to be quite efficient for huge lists and also works fine for lists of tuples as explained here):
import pandas as pd
pd.factorize(names)
will then return
(array([(array([0, 0, 1, 1, 1, 2, 2, 0, 2, 3, 3, 3, 0]),
array(['ll', 'hl', 'LL', 'HL'], dtype=object))
Therefore,
numbers = pd.factorize(names)[0]
I managed to modify your script very slightly and it looks ok:
names = ['ll', 'hl', 'll', 'hl', 'LL', 'll', 'LL', 'HL', 'hl', 'HL', 'LL', 'HL', 'zzz']
names.sort()
print(names)
numbers = []
num = 0
for item in range(len(names)):
if item == len(names) - 1:
break
elif names[item] == names[item+1]:
numbers.append(num)
else:
numbers.append(num)
num = num + 1
numbers.append(num)
print(numbers)
You can see it is very simmilar, only thing is that instead adding number for NEXT element i add number for CURRENT element. That's all. Oh, and sorting. It sorts capital first, then lowercase in this example, you can play with sort(key= lambda:x ...)
if you wish to change that. (Perhaps like this: names.sort(key = lambda x: (x.upper() if x.lower() == x else x.lower()))
)
If the condition is that the numbers are unique and the exact number is not important, then you can build a mapping relating each item in the list to a unique number on the fly, assigning values from a count object:
from itertools import count
names = ['ll', 'll', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'll']
d = {}
c = count()
numbers = [d.setdefault(i, next(c)) for i in names]
print(numbers)
# [0, 0, 2, 2, 4, 4, 4, 7, 0]
You could do away with the extra names by using map on the list and a count object, and setting the map function as {}.setdefault
(see @StefanPochmann's comment):
from itertools import count
names = ['ll', 'll', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'll']
numbers = map({}.setdefault, names, count()) # call list() on map for Py3
print(numbers)
# [0, 0, 2, 2, 4, 4, 4, 7, 0]
As an extra, you could also use np.unique, in case you already have numpy installed:
import numpy as np
_, numbers = np.unique(names, return_inverse=True)
print(numbers)
# [3 3 2 2 1 1 1 0 3]
If you have k
different values, this maps them to integers 0
to k-1
in order of first appearance:
>>> names = ['b', 'c', 'd', 'c', 'b', 'a', 'b']
>>> tmp = {}
>>> [tmp.setdefault(name, len(tmp)) for name in names]
[0, 1, 2, 1, 0, 3, 0]
To make it more generic you can wrap it in a function, so these hard-coded values don't do any harm, because they are local.
If you use efficient lookup-containers (I'll use a plain dictionary) you can keep the first index of each string without loosing to much performance:
def your_function(list_of_strings):
encountered_strings = {}
result = []
idx = 0
for astring in list_of_strings:
if astring in encountered_strings: # check if you already seen this string
result.append(encountered_strings[astring])
else:
encountered_strings[astring] = idx
result.append(idx)
idx += 1
return result
And this will assign the indices in order (even if that's not important):
>>> your_function(['ll', 'll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'HL', 'HL'])
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]
This needs only one iteration over your list of strings, which makes it possible to even process generators and similar.
Since you are mapping strings to integers, that suggests using a dict. So you can do the following:
d = dict()
counter = 0
for name in names:
if name in d:
continue
d[name] = counter
counter += 1
numbers = [d[name] for name in names]