I have a text file that contains tweets per line, that need to be altered for a machine learning format. Im using python and basic unix text manipulation (regex) to achieve
from string import punctuation as pnc
tokens = {':)', 'cool', 'happy', 'fun'}
tweets = ['this has been a fun day :)', 'i find python cool! it makes me happy']
for tweet in tweets:
s = [(word in tokens or word.strip(pnc) in tokens) for word in tweet.split()]
print(' '.join('1' if t else '0' for t in s))
Output:
0 0 0 0 1 0 1
0 0 0 1 0 0 0 1
The or
in the 4th line is there to handle :)
, as suggested by @EOL.
There are still cases that will not be handled correctly, such as with cool :), I like it
. The problem is inherent to the requirements.
If you needed this as an all regex, then have a look at my solution here Changing lines of text into binary type pattern
In awk
:
awk '
NR==FNR {
a[$1];
next
}
{
gsub(/!/, "", $0) # This will ignore `!`. Other rules can be added.
for (i=1;i<=NF;i++) {
if ($i in a) {
printf "1 "
}
else {
printf "0 "
}
}
print ""
}' lookup tweets
gsub
line to handle special cases.)[jaypal:~/Temp] cat lookup
:)
cool
happy
fun
[jaypal:~/Temp] cat tweets
this has been a fun day :)
i find python cool! it makes me happy
[jaypal:~/Temp] awk '
NR==FNR {
a[$1];
next
}
{
gsub(/!/, "", $0)
for (i=1;i<=NF;i++) {
if ($i in a) {
printf "1 "
}
else {
printf "0 "
}
}
print ""
}' lookup tweets
0 0 0 0 1 0 1
0 0 0 1 0 0 0 1