What is feature hashing (hashing-trick)?

前端 未结 3 1811
孤城傲影
孤城傲影 2021-02-12 18:03

I know feature hashing (hashing-trick) is used to reduce the dimensionality and handle sparsity of bit vectors but I don\'t understand how it really works. Can anyone explain th

3条回答
  •  一整个雨季
    2021-02-12 19:00

    On Pandas, you could use something like this:

    import pandas as pd
    import numpy as np
    
    data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
            'year': [2000, 2001, 2002, 2001, 2002],
            'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
    
    data = pd.DataFrame(data)
    
    def hash_col(df, col, N):
        cols = [col + "_" + str(i) for i in range(N)]
        def xform(x): tmp = [0 for i in range(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols)
        df[cols] = df[col].apply(xform)
        return df.drop(col,axis=1)
    
    print hash_col(data, 'state',4)
    

    The output would be

       pop  year  state_0  state_1  state_2  state_3
    0  1.5  2000        0        1        0        0
    1  1.7  2001        0        1        0        0
    2  3.6  2002        0        1        0        0
    3  2.4  2001        0        0        0        1
    4  2.9  2002        0        0        0        1
    

    Also on Series level, you could

    import numpy as np, os import sys, pandas as pd

    def hash_col(df, col, N):
        df = df.replace('',np.nan)
        cols = [col + "_" + str(i) for i in range(N)]
        tmp = [0 for i in range(N)]
        tmp[hash(df.ix[col]) % N] = 1
        res = df.append(pd.Series(tmp,index=cols))
        return res.drop(col)
    
    a = pd.Series(['new york',30,''],index=['city','age','test'])
    b = pd.Series(['boston',30,''],index=['city','age','test'])
    
    print hash_col(a,'city',10)
    print hash_col(b,'city',10)
    

    This will work per single Series, column name will be assumed to be a Pandas index. It also replaces blank strings with nan, and floats everything.

    age        30
    test      NaN
    city_0      0
    city_1      0
    city_2      0
    city_3      0
    city_4      0
    city_5      0
    city_6      0
    city_7      1
    city_8      0
    city_9      0
    dtype: object
    age        30
    test      NaN
    city_0      0
    city_1      0
    city_2      0
    city_3      0
    city_4      0
    city_5      1
    city_6      0
    city_7      0
    city_8      0
    city_9      0
    dtype: object
    

    If, however, there is a vocabulary, and you simply want to one-hot-encode, you could use

    import numpy as np
    import pandas as pd, os
    import scipy.sparse as sps
    
    def hash_col(df, col, vocab):
        cols = [col + "=" + str(v) for v in vocab]
        def xform(x): tmp = [0 for i in range(len(vocab))]; tmp[vocab.index(x)] = 1; return pd.Series(tmp,index=cols)
        df[cols] = df[col].apply(xform)
        return df.drop(col,axis=1)
    
    data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
            'year': [2000, 2001, 2002, 2001, 2002],
            'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
    
    df = pd.DataFrame(data)
    
    df2 = hash_col(df, 'state', ['Ohio','Nevada'])
    
    print sps.csr_matrix(df2)
    

    which will give

       pop  year  state=Ohio  state=Nevada
    0  1.5  2000           1             0
    1  1.7  2001           1             0
    2  3.6  2002           1             0
    3  2.4  2001           0             1
    4  2.9  2002           0             1
    

    I also added sparsification of the final dataframe as well. In incremental setting where we might not have encountered all values beforehand (but we somehow obtained the list of all possible values somehow), the approach above can be used. Incremental ML methods would need the same number of features at each increment, hence one-hot encoding must produce the same number of rows at each batch.

提交回复
热议问题