Create hash value for each row of data with selected columns in dataframe in python pandas

前端 未结 6 821
北恋
北恋 2020-12-06 10:18

I have asked similar question in R about creating hash value for each row of data. I know that I can use something like hashlib.md5(b\'Hello World\').hexdigest()

相关标签:
6条回答
  • 2020-12-06 10:57
    df.set_index(pd.util.hash_pandas_object(df), drop=False, inplace=True)
    
    0 讨论(0)
  • 2020-12-06 10:58

    Create hash value for each row of data with selected columns in dataframe in python pandas

    These solutions work for the life of the Python process.

    If order matters, one method would be to coerce the row (a Series object) to a tuple:

    >>> hash(tuple(df.irow(1)))
    -4901655572611365671
    

    This demonstrates order matters for tuple hashing:

    >>> hash((1,2,3))
    2528502973977326415
    >>> hash((3,2,1))
    5050909583595644743
    

    To do so for every row, appended as a column would look like this:

    >>> df = df.drop('hash', 1) # lose the old hash
    >>> df['hash'] = pd.Series((hash(tuple(row)) for _, row in df.iterrows()))
    >>> df
               y  x0                 hash
    0  11.624345  10 -7519341396217622291
    1  10.388244  11 -6224388738743104050
    2  11.471828  12 -4278475798199948732
    3  11.927031  13 -1086800262788974363
    4  14.865408  14  4065918964297112768
    5  12.698461  15  8870116070367064431
    6  17.744812  16 -2001582243795030948
    7  16.238793  17  4683560048732242225
    8  18.319039  18 -4288960467160144170
    9  18.750630  19  7149535252257157079
    
    [10 rows x 3 columns]
    

    If order does not matter, use the hash of frozensets instead of tuples:

    >>> hash(frozenset((3,2,1)))
    -272375401224217160
    >>> hash(frozenset((1,2,3)))
    -272375401224217160
    

    Avoid summing the hashes of all of the elements in the row, as this could be cryptographically insecure and lead to hashes that fall outside the range of the original.

    (You could use modulo to constrain the range, but this amounts to rolling your own hash function, and the best practice is not to.)

    You can make permanent cryptographic quality hashes, for example using sha256, as well using the hashlib module.

    There is some discussion of the API for cryptographic hash functions in PEP 452.

    Thanks to users Jamie Marshal and Discrete Lizard for their comments.

    0 讨论(0)
  • 2020-12-06 11:04

    dfObj['Hash Key'] = (dfObj['DEAL_ID'].map(str) +dfObj['COST_CODE'].map(str) +dfObj['TRADE_ID'].map(str)).apply(hash)

    print(dfObj['Hash Key'])

    0 讨论(0)
  • 2020-12-06 11:09

    I've came up with this adaption from the code provided on the question:

    new_df2 = df.copy()
    key_combination = ['col1', 'col2', 'col3', 'col4']
    new_df2.index = list(map(lambda x: hashlib.sha1('-'.join([col_value for col_value in x]).encode('utf-8')).hexdigest(), new_df2[key_combination].values))
    
    0 讨论(0)
  • 2020-12-06 11:10

    This is now available in pandas.util.hash_pandas_object:

    pandas.util.hash_pandas_object(df)
    
    0 讨论(0)
  • 2020-12-06 11:13

    Or simply:

    df.apply(lambda x: hash(tuple(x)), axis = 1)
    

    As an example:

    import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.random.rand(3,5))
    print df
    df.apply(lambda x: hash(tuple(x)), axis = 1)
    
         0         1         2         3         4
    0  0.728046  0.542013  0.672425  0.374253  0.718211
    1  0.875581  0.512513  0.826147  0.748880  0.835621
    2  0.451142  0.178005  0.002384  0.060760  0.098650
    
    0    5024405147753823273
    1    -798936807792898628
    2   -8745618293760919309
    
    0 讨论(0)
提交回复
热议问题