问题
Seems like this is rather simple, but I'm having trouble.
I have a text document that looks, for example, like this:
This is a
TEXT DOCUMENT with
SOME capitalized words
BUT NOT all of them are
ALL CAPS
iPhone
What I would like is to parse this document and match only whole words made up of only uppercase letters, like so:
TEXT DOCUMENT
SOME
BUT NOT
ALL CAPS
I wrote this:
grep -o "\w[[:upper:]]\w" Untitled.txt
This gets pretty close but, alas, returns this:
TEX
DOC
UME
SOM
BUT
NOT
ALL
CAP
iPh
...which, candidly, I don't understand.
So: what might I be missing? egrep doesn't work very well under OS X because I'm limited by FreeBSD's grep (grep (BSD grep) 2.5.1-FreeBSD), I guess, so many of the solutions I've found for egrep that seem like they would work don't work as expected.
回答1:
You miss *
and also \w
is any word character. Correct regexp is:
\<[[:upper:]][[:upper:]]*\>
\<
\>
match word boundaries
回答2:
To complement Zbynek Vyskovsky - kvr000's helpful answer:
grep
's -E
option allows use of extended regular expression, which includes quantifier +
to mean one or more, which simplifies the solution:
grep -Eo '\<[[:upper:]]+\>' Untitled.txt
Also, as mentioned in Benjamin W.'s answer, -w
can be used to match on word boundaries without having to specify it as part of the regex:
grep -Ewo '[[:upper:]]+' Untitled.txt
Note, however, that -w
is a nonstandard option (but both BSD/OSX and GNU grep
implement it).
As for egrep
: it is nothing more than an (effective) alias of grep -E
, which, as stated, activates support for extended regular expressions, but the exact set of features is platform-dependent.
Additionally, only GNU grep
supports the -P
option to support PCREs (Perl-Compatible Regular Expression), which offer even more features and flexibility.
回答3:
The example output shows multiple space separated uppercase words on the same line, which can be achieved with
$ grep -ow '[[:upper:]][[:upper:][:space:]]*[[:upper:]]' infile
TEXT DOCUMENT
SOME
BUT NOT
ALL CAPS
Any sequence starting and ending with an uppercase character, and uppercase characters or whitespace between them. -o
returns the matches only, and -w
makes sure that we don't match something like WORDlowercase
.
回答4:
You can use this command:
grep -o -E "\<[[:upper:]]+\>" Untitled.txt
-E
activates extended regexp, this makes+
available which stand for 1 or more repetitions\<
and\>
are anchor marking the begin and end of a word- the whole regex means a sequence of one or more uppercase characters that made up the whole word
Your original regexp gave you three letter matches, because \w
stands for [_[:alnum:]]
, so you instructed grep to match something which consists of three characters:
- the first and third from the
[_[:alnum:]]
- the second from the [[:upper:]] range
回答5:
An "old school" RE would be fewer characters:
grep -o '[A-Z][A-Z]*' Untitled.txt
It uses the -o
option to Only print matching words and matches against uppercase A through Z.
Adding -w
to search words and -E
to invoke the Extended regular expressions allows this one that is even shorter:
grep -woE '[A-Z]+\>' Untitled.txt
来源:https://stackoverflow.com/questions/35107131/grep-whole-words-made-of-only-uppercase-letters