What is feature hashing (hashing-trick)?

前端 未结 3 1814
孤城傲影
孤城傲影 2021-02-12 18:03

I know feature hashing (hashing-trick) is used to reduce the dimensionality and handle sparsity of bit vectors but I don\'t understand how it really works. Can anyone explain th

3条回答
  •  南笙
    南笙 (楼主)
    2021-02-12 18:55

    Large sparse feature can be derivate from interaction, U as user and X as email, so the dimension of U x X is memory intensive. Usually, task like spam filtering has time limitation as well.

    Hash trick like other hash function store binary bits (index) which make large scale training feasible. In theory, more hashed length more performance gain, as illustrated in the original paper.

    It allocate origin feature into different bucket (finite length of feature space) so that their semantic get kept. Even when spammer use typo to miss on the radar. Although there is distortion error, heir hashed form remain close.

    For example,

    "the quick brown fox" transform to:

    h(the) mod 5 = 0
    
    h(quick) mod 5 = 1
    
    h(brown) mod 5 = 1
    
    h(fox) mod 5 = 3
    

    Use index rather then text value, saves space.

    To summarize some of the applications:

    • dimensionality reduction for high dimension feature vector

      • text in email classification task, collaborate filtering on spam
    • sparsification

    • bag-of-words on the fly

    • cross-product features

    • multi-task learning

    Reference:

    • Origin paper:

      1. Feature Hashing for Large Scale Multitask Learning

      2. Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A., Strehl, A., & Vishwanathan, V. (2009). Hash kernels

    • What is the hashing trick

    • Quora

    • Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing

    Implementation:

    • Langford, J., Li, L., & Strehl, A. (2007). Vow- pal wabbit online learning project (Technical Report). http://hunch.net/?p=309.

提交回复
热议问题