Best way to change words into numbers using specific word list

前端 未结 3 711
广开言路
广开言路 2021-01-14 20:58

I have a text file that contains tweets per line, that need to be altered for a machine learning format. Im using python and basic unix text manipulation (regex) to achieve

相关标签:
3条回答
  • 2021-01-14 21:38
    from string import punctuation as pnc
    tokens = {':)', 'cool', 'happy', 'fun'}
    tweets = ['this has been a fun day :)', 'i find python cool! it makes me happy']
    for tweet in tweets:
        s = [(word in tokens or word.strip(pnc) in tokens) for word in tweet.split()]
        print(' '.join('1' if t else '0' for t in s))
    

    Output:

    0 0 0 0 1 0 1
    0 0 0 1 0 0 0 1
    

    The or in the 4th line is there to handle :), as suggested by @EOL.

    There are still cases that will not be handled correctly, such as with cool :), I like it. The problem is inherent to the requirements.

    0 讨论(0)
  • 2021-01-14 21:38

    If you needed this as an all regex, then have a look at my solution here Changing lines of text into binary type pattern

    0 讨论(0)
  • 2021-01-14 21:46

    In awk:

    awk '
    NR==FNR {
        a[$1];
        next
        }
    
    { 
        gsub(/!/, "", $0)  # This will ignore `!`. Other rules can be added.
        for (i=1;i<=NF;i++) {
            if ($i in a) {
            printf "1 "
            }
        else {
            printf "0 "
            }
        }
        print ""
    }' lookup tweets
    

    Test: (You'll probably need to alter gsub line to handle special cases.)

    [jaypal:~/Temp] cat lookup
    :)
    cool
    happy
    fun
    
    [jaypal:~/Temp] cat tweets
    this has been a fun day :)
    i find python cool! it makes me happy
    
    [jaypal:~/Temp] awk '
    NR==FNR {
        a[$1];
        next
        }
    
    { 
        gsub(/!/, "", $0)
        for (i=1;i<=NF;i++) {
            if ($i in a) {
            printf "1 "
            }
        else {
            printf "0 "
            }
        }
        print ""
    }' lookup tweets
    0 0 0 0 1 0 1
    0 0 0 1 0 0 0 1
    
    0 讨论(0)
提交回复
热议问题