How can I remove the stop words from sentence using shell script?

浪子不回头ぞ 提交于 2021-01-07 06:57:50

问题


I'm trying to remove stop words from sentences in file?

Stop Word which I mean :
[I, a, an, as, at, the, by, in, for, of, on, that]

I have these sentences in file my_text.txt :

One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program

Then I want to remove stop word form the sentence above

I used this script :

array=( I a an as at the by in for of on that  )
for i in "${array[@]}"
do
cat $p  | sed -e 's/\<$i\>//g' 
done < my_text.txt

But the output is:

One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program

The expected output should be :

One primary goals design Unix system was to create an environment promoted efficient program

Can somebody help?


回答1:


Like this, assuming $p is an existing file:

 sed -i -e "s/\<$i\>//g" "$p"

You have to use double quotes, not single quotes to get variables expanded.

The -i switch replace in line.

Learn how to quote properly in shell, it's very important :

"Double quote" every literal that contains spaces/metacharacters and every expansion: "$var", "$(command "$var")", "${array[@]}", "a & b". Use 'single quotes' for code or literal $'s: 'Costs $5 US', ssh host 'echo "$HOSTNAME"'. See
http://mywiki.wooledge.org/Quotes
http://mywiki.wooledge.org/Arguments
http://wiki.bash-hackers.org/syntax/words

Finally

array=( I a an as at the by in for of on that  )
for i in "${array[@]}"
do
    sed -i -e "s/\<$i\>\s*//g" Input_File 
done

Bonus

Try without \s* to understand why I added this regex




回答2:


One in awk. It's a working prop but needs proper punctuation handling and then some (then again luckily your data had none):

$ awk '
NF==FNR {                         # process stop words
    split($0,a,/,/)               # comma separated without space
    for(i in a)                   # they go to b hash
        b[a[i]]
    next
}
{                                 # reading the text
    for(i=1;i<=NF;i++)            # iterating them words
        if(!($i in b))            # if current word notfound in stop words
            printf "%s%s",$i,OFS  # output it (leftover space in the end, sorry)
        print ""                  # newline in the 
}' words text

Output:

One primary goals design Unix system was to create environment promoted efficient program 

Why awk? Shell is a tool for managing files and launching programs. All apart that are better handled elsewhere.




回答3:


I am also very fond of using awk in text processing. Assuming the input data is the mytext.txt file, and script is the file containing the code below, simply run it as awk -f mytext.txt script.

Also, this should make it easier to change the stop words when needed, by changing the stopwords variable. Keep in mind both mytext.txt and stopwords must contain only space separated words.

BEGIN {
stopwords = "I a an as at the by in for of on that"
split(stopwords, wordarray)
ORS = " "
RS = " "
}

{
equals = 0
for (w in wordarray)
  if ($0 == wordarray[w])
    equals = 1
if (equals == 0) print $0
}


来源:https://stackoverflow.com/questions/65331755/how-can-i-remove-the-stop-words-from-sentence-using-shell-script

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!