I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into other file with a word per line. Example my file cotains:
Hola mundo, hablo español y no sé si escribí bien la
pregunta, ojalá me puedan entender y ayudar
Adiós.
The the output file should contains:
Hola
mundo
hablo
español
...
Thank!
Using tr:
tr -s '[[:punct:][:space:]]' '\n' < file
The simplest tool is fmt:
fmt -1 <your-file
fmt designed to break lines to fit the specified width and if you provide -1
it breaks immediately after the word. See man fmt
for documentation. Inspired by http://everythingsysadmin.com/2012/09/unorthodoxunix.html
Using sed
:
$ sed -e 's/[[:punct:]]*//g;s/[[:space:]]\+/\n/g' < inputfile
basically this deletes all punctuation and replaces any spaces with newlines. This also assumes your flavor of sed
understands \n
. Some do not -- in which case you can just use a literal newline instead (i.e. by embedding it inside your quotes).
grep -o
prints only the parts of matching line that matches pattern
grep -o '[[:alpha:]]*' file
cat input.txt | tr -d ",." | tr " \t" "\n" | grep -e "^$" -v
tr -d ",." deletes "," and "."
tr " \t" "\n" changes spaces and tabs to newlines
grep -e "^$" -v deletes empty lines (in case of two or more spaces)
this awk line may work too?
awk 'BEGIN{FS="[[:punct:] ]*";OFS="\n"}{$1=$1}1' inputfile
Based on your responses so far, I THINK what you probably are looking for is to treat words as sequences of characters separated by spaces, commas, sentence-ending characters (i.e. "." "!" or "?" in English) and other characters that you would NOT normally find in combination with alpha-numeric characters (e.g. "<" and ";" but not '
-
#
$
%
). Now, "." is a sentence ending character but you said that $27.00
should be considered a "word" so .
needs to be treated differently depending on context. I think the same is probably true for "-" and maybe some other characters.
So you need a solution that will convert this:
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "foo@bar.com".
into this:
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
foo@bar.com
Is that correct?
Try this using GNU awk so we can set RS to more than one character:
$ cat file
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "foo@bar.com".
$ gawk -v RS="[[:space:]?!]+" '{gsub(/^[^[:alnum:]$#]+|[^[:alnum:]%]+$/,"")} $0!=""' file
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
foo@bar.com
Try to come up with some other test cases to see if this always does what you want.
A very simple option would first be,
sed 's,\(\w*\),\1\n,g' file
beware it doens't handle neither apostrophes nor punctuation
Using perl
:
perl -ne 'print join("\n", split)' < file
Using perl :
perl -pe 's/(?:\p{Punct}|\s+)+/\n/g' file
Output
Hola
mundo
hablo
español
y
no
sé
si
escribí
bien
la
pregunta
ojal�
me
puedan
entender
y
ayudar
Adiós
perl -ne 'print join("\n", split)'
Sorry @jsageryd
That one liner does not give correct answer as it joins last word on line with first word on next.
This is better but generates a blank line for each blank line in src. Pipe via | sed '/^$/d' to fix that
perl -ne '{ print join("\n",split(/[[:^word:]]+/)),"\n"; }'
来源:https://stackoverflow.com/questions/15501652/how-split-a-file-in-words-in-unix-command-line