How split a file in words in unix command line?

I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into other file with a word per line. Example my file cotains:

Hola mundo, hablo español y no sé si escribí bien la
pregunta, ojalá me puedan entender y ayudar
Adiós.

The the output file should contains:

Hola
mundo
hablo
español
...

Thank!

Using tr:

tr -s '[[:punct:][:space:]]' '\n' < file

The simplest tool is fmt:

fmt -1 <your-file

fmt designed to break lines to fit the specified width and if you provide -1 it breaks immediately after the word. See man fmt for documentation. Inspired by http://everythingsysadmin.com/2012/09/unorthodoxunix.html

Using sed:

$ sed -e 's/[[:punct:]]*//g;s/[[:space:]]\+/\n/g' < inputfile

basically this deletes all punctuation and replaces any spaces with newlines. This also assumes your flavor of sed understands \n. Some do not -- in which case you can just use a literal newline instead (i.e. by embedding it inside your quotes).

grep -o prints only the parts of matching line that matches pattern

grep -o '[[:alpha:]]*' file

cat input.txt | tr -d ",." | tr " \t" "\n" | grep -e "^$" -v

tr -d ",." deletes "," and "."

tr " \t" "\n" changes spaces and tabs to newlines

grep -e "^$" -v deletes empty lines (in case of two or more spaces)

this awk line may work too?

awk 'BEGIN{FS="[[:punct:] ]*";OFS="\n"}{$1=$1}1'  inputfile

Based on your responses so far, I THINK what you probably are looking for is to treat words as sequences of characters separated by spaces, commas, sentence-ending characters (i.e. "." "!" or "?" in English) and other characters that you would NOT normally find in combination with alpha-numeric characters (e.g. "<" and ";" but not ' - # $ %). Now, "." is a sentence ending character but you said that $27.00 should be considered a "word" so . needs to be treated differently depending on context. I think the same is probably true for "-" and maybe some other characters.

So you need a solution that will convert this:

I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "foo@bar.com".

into this:

I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at 
foo@bar.com

Is that correct?

Try this using GNU awk so we can set RS to more than one character:

$ cat file
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "foo@bar.com".

$ gawk -v RS="[[:space:]?!]+" '{gsub(/^[^[:alnum:]$#]+|[^[:alnum:]%]+$/,"")} $0!=""' file
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
foo@bar.com

Try to come up with some other test cases to see if this always does what you want.

A very simple option would first be,

sed 's,\(\w*\),\1\n,g' file

beware it doens't handle neither apostrophes nor punctuation

Using perl:

perl -ne 'print join("\n", split)' < file

Using perl :

perl -pe 's/(?:\p{Punct}|\s+)+/\n/g' file

Output

Hola
mundo
hablo
español
y
no
sé
si
escribí
bien
la
pregunta
ojal�
me
puedan
entender
y
ayudar
Adiós

perl -ne 'print join("\n", split)'

Sorry @jsageryd

That one liner does not give correct answer as it joins last word on line with first word on next.

This is better but generates a blank line for each blank line in src. Pipe via | sed '/^$/d' to fix that

perl -ne '{ print join("\n",split(/[[:^word:]]+/)),"\n"; }'

来源：https://stackoverflow.com/questions/15501652/how-split-a-file-in-words-in-unix-command-line

标签

unix

command-line

awk

tokenize