问题
I'm trying to remove stop words from sentences in file?
Stop Word which I mean :[I, a, an, as, at, the, by, in, for, of, on, that]
I have these sentences in file my_text.txt
:
One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program
Then I want to remove stop word form the sentence above
I used this script :
array=( I a an as at the by in for of on that )
for i in "${array[@]}"
do
cat $p | sed -e 's/\<$i\>//g'
done < my_text.txt
But the output is:
One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program
The expected output should be :
One primary goals design Unix system was to create an environment promoted efficient program
Can somebody help?
回答1:
Like this, assuming $p
is an existing file:
sed -i -e "s/\<$i\>//g" "$p"
You have to use double quotes, not single quotes to get variables expanded.
The -i
switch replace in line.
Learn how to quote properly in shell, it's very important :
"Double quote" every literal that contains spaces/metacharacters and every expansion:
"$var"
,"$(command "$var")"
,"${array[@]}"
,"a & b"
. Use'single quotes'
for code or literal$'s: 'Costs $5 US'
,ssh host 'echo "$HOSTNAME"'
. See
http://mywiki.wooledge.org/Quotes
http://mywiki.wooledge.org/Arguments
http://wiki.bash-hackers.org/syntax/words
Finally
array=( I a an as at the by in for of on that )
for i in "${array[@]}"
do
sed -i -e "s/\<$i\>\s*//g" Input_File
done
Bonus
Try without \s*
to understand why I added this regex
回答2:
One in awk. It's a working prop but needs proper punctuation handling and then some (then again luckily your data had none):
$ awk '
NF==FNR { # process stop words
split($0,a,/,/) # comma separated without space
for(i in a) # they go to b hash
b[a[i]]
next
}
{ # reading the text
for(i=1;i<=NF;i++) # iterating them words
if(!($i in b)) # if current word notfound in stop words
printf "%s%s",$i,OFS # output it (leftover space in the end, sorry)
print "" # newline in the
}' words text
Output:
One primary goals design Unix system was to create environment promoted efficient program
Why awk? Shell is a tool for managing files and launching programs. All apart that are better handled elsewhere.
回答3:
I am also very fond of using awk in text processing. Assuming the input data is the mytext.txt
file, and script
is the file containing the code below, simply run it as awk -f mytext.txt script
.
Also, this should make it easier to change the stop words when needed, by changing the stopwords
variable. Keep in mind both mytext.txt
and stopwords
must contain only space separated words.
BEGIN {
stopwords = "I a an as at the by in for of on that"
split(stopwords, wordarray)
ORS = " "
RS = " "
}
{
equals = 0
for (w in wordarray)
if ($0 == wordarray[w])
equals = 1
if (equals == 0) print $0
}
来源:https://stackoverflow.com/questions/65331755/how-can-i-remove-the-stop-words-from-sentence-using-shell-script