Anonymizing data / replacing names

后端未结

关注

 3  733

一整个雨季

Normally I anonymize my data by using hashlib and using the .apply(hash) function.

Now im trying a new approach, imagine I have to following df called \'data\':

相关标签:

3条回答

旧巷少年郎

2021-01-24 06:39

labels, uniques =  pd.factorize(df['name'])
labels = ['person_'+str(l) for l in labels]
df['contributor_anonymized'] = labels

0 讨论(0)

忘了有多久

2021-01-24 06:43
Maybe try to create a data frame called "index" for this operation and keep unique name values inside it?

Then produce masks with unique name indexes and merge the resulting data frame indexwith data.
```
index = pd.DataFrame()
index['name'] = df['name'].unique()
index['mask'] = index['name'].apply(lambda x : 'person' + 
str(index[index.name == x].index[0] + 1))

data.merge(index, how='left')[['mask', 'amount']]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

长情又很酷

2021-01-24 06:47

I think faster solution is use factorize for unique values, add 1, convert to Series and strings and prepend Person string:

df['contributor'] = 'Person' + pd.Series(pd.factorize(df['contributor'])[0] + 1).astype(str)
print (df)
  contributor  amount payed
0     Person1            10
1     Person2            28
2     Person3            49
3     Person2            77
4     Person4            31

0 讨论(0)