Best way to change words into numbers using specific word list

前端未结

关注

 3  711

I have a text file that contains tweets per line, that need to be altered for a machine learning format. Im using python and basic unix text manipulation (regex) to achieve

相关标签:

3条回答

失恋的感觉

2021-01-14 21:38

from string import punctuation as pnc
tokens = {':)', 'cool', 'happy', 'fun'}
tweets = ['this has been a fun day :)', 'i find python cool! it makes me happy']
for tweet in tweets:
    s = [(word in tokens or word.strip(pnc) in tokens) for word in tweet.split()]
    print(' '.join('1' if t else '0' for t in s))

Output:

0 0 0 0 1 0 1
0 0 0 1 0 0 0 1

The or in the 4th line is there to handle :), as suggested by @EOL.

There are still cases that will not be handled correctly, such as with cool :), I like it. The problem is inherent to the requirements.

0 讨论(0)

隐瞒了意图╮

2021-01-14 21:38

If you needed this as an all regex, then have a look at my solution here Changing lines of text into binary type pattern

0 讨论(0)
发布评论:

提交评论
- 加载中...

暗喜

2021-01-14 21:46

In awk:

awk '
NR==FNR {
    a[$1];
    next
    }

{ 
    gsub(/!/, "", $0)  # This will ignore `!`. Other rules can be added.
    for (i=1;i<=NF;i++) {
        if ($i in a) {
        printf "1 "
        }
    else {
        printf "0 "
        }
    }
    print ""
}' lookup tweets

Test: (You'll probably need to alter `gsub` line to handle special cases.)

[jaypal:~/Temp] cat lookup
:)
cool
happy
fun

[jaypal:~/Temp] cat tweets
this has been a fun day :)
i find python cool! it makes me happy

[jaypal:~/Temp] awk '
NR==FNR {
    a[$1];
    next
    }

{ 
    gsub(/!/, "", $0)
    for (i=1;i<=NF;i++) {
        if ($i in a) {
        printf "1 "
        }
    else {
        printf "0 "
        }
    }
    print ""
}' lookup tweets
0 0 0 0 1 0 1
0 0 0 1 0 0 0 1

0 讨论(0)

Best way to change words into numbers using specific word list

Test: (You'll probably need to alter gsub line to handle special cases.)

Test: (You'll probably need to alter `gsub` line to handle special cases.)