I have a sentence like
This is for example
I want to write this to a file such that each word in this sentence is written to a s
Do you care about punctuation marks? For example in some invocations you would see e.g. a 'word' like (etc) as that exactly with the parentheses. Or the word would be 'parentheses.' rather than 'parentheses'. If you're parsing a file with proper sentences that could be a problem esp if you're wanting to sort by word or even get a word count for each word.
There are ways to deal with this but there are some caveats and certainly there's room for improvement. These happen to do with numbers, dashes (in numbers) and decimal points/dots (in numbers). Perhaps having an exact set of rules would help resolve this but the below examples can give you some things to work on. I have made some contrived input examples to demonstrate these flaws (or whatever you wish to call them).
$ echo "This is an example sentence with punctuation marks and digits i.e. , . ; \! 7 8 9" | grep -o -E '\<[A-Za-z0-9.]*\>'
This
is
an
example
sentence
with
punctuation
marks
and
digits
i.e
7
8
9
As you can see the i.e.` turns out to be just i.e and the punctuation marks otherwise are not shown. Okay but this leaves out things like version numbers in the form of major.minor.revision-release e.g. 0.0.1-1; can this be shown too? Yes:
$ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[-A-Za-z0-9.]*\>'
The
current
version
is
0.0.1-1
The
previous
version
was
current
from
2017-2018
Observe that the sentences do not end with a full stop. What happens if you add a space between the years and the dash? You won't have the dash but each year will be on its own line:
$ echo "2017 - 2018" | grep -o -E '\<[-A-Za-z0-9.]*\>'
2017
2018
The question then becomes if you want -
by themselves to be counted; by the very nature of separating words you won't have the years as a single string if there are spaces. Because it's not a word by itself I would think not.
I am sure these could be simplified further. In addition if you don't want any punctuation or numbers at all you could change it to:
$ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[A-Za-z]*\>'
The
current
version
is
The
previous
version
was
current
from
If you wanted to have the numbers:
$ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[A-Za-z0-9]*\>'
The
current
version
is
0
0
1
1
The
previous
version
was
current
from
2017
2018
As for 'words' with both letters and numbers that's another thing that might or might not be of consideration but demonstrating the above:
$ echo "The current version is 0.0.1-1. test1."|grep -o -E '\<[A-Za-z0-9]*\>'
The
current
version
is
0
0
1
1
test1
Outputs them. But the following does not (because it doesn't consider numbers at all):
$ echo "The current version is 0.0.1-1. test1."|grep -o -E '\<[A-Za-z]*\>'
The
current
version
is
It's quite easy to disregard punctuation marks but in some cases there might be need or desire for them. In the case of e.g. I suppose you could use say sed to change lines like e.g to e.g. but that would be a personal preference, I guess.
I can summarise how it works but only just; I’m far too tired to even consider much:
I will only explain the invocation grep -o -E '\<[-A-Za-z0-9.]*\>'
but much of it is the same in the others (the vertical bar/pipe symbol in extended grep allows for more than one pattern):
The -o
option is for only printing matches rather than the entire line. The -E
is for extended grep (could just as well have used egrep). As for the regexp itself:
The <\
and \>
are word boundaries (beginning and ending respectively - you can specify only one if you want); I believe the -w
option is the same as specifying both but maybe the invocation is a bit different (I don't actually know).
The '\<[-A-Za-z0-9.]*\>'
says dashes, upper and lower case letters and a dot zero or more times. As for why then it turns e.g. to .e.g I at this time can only say it is the pattern but I do not have the faculties to consider it more.
#!/bin/bash
if [ $# -eq 0 ]; then
echo "Usage: $(basename ${0}) [FILE...]"
exit 1
fi
for file do
if [ -e "${file}" ]
then
echo "** ${file}: "
grep -o -E '\<[-A-Za-z0-9.]*\>' "${file}"|sort | uniq -c | sort -rn
else
echo >&2 "${1}: file not found"
continue
fi
done
Example:
$ cat example
The current version is 0.0.1-1 but the previous version was non-existent.
This sentence contains an abbreviation i.e. e.g. (so actually two abbreviations).
This sentence has no numbers and no punctuation
$ ./wordfreq example
** example:
2 version
2 sentence
2 no
2 This
1 was
1 two
1 the
1 so
1 punctuation
1 previous
1 numbers
1 non-existent
1 is
1 i.e
1 has
1 e.g
1 current
1 contains
1 but
1 and
1 an
1 actually
1 abbreviations
1 abbreviation
1 The
1 0.0.1-1
N.B. I didn't transliterate upper case to lower case so the words 'The' and 'the' show up as different words. If you wanted them to be all lower case you could change the grep invocation in the script to be piped to tr before sorting:
grep -o -E '\<[-A-Za-z0-9.]*\>' "${file}"|tr '[A-Z]' '[a-z]'|sort | uniq -c | sort -rn
Oh and since you asked if you want to write it to a file you can just add to the command line (this is for the raw invocation):
> output_file
For the script you would use it like:
$ ./wordfreq file1 file2 file3 > output_file