grep whole words made of only uppercase letters

て烟熏妆下的殇ゞ 提交于 2019-12-02 05:48:34

问题


Seems like this is rather simple, but I'm having trouble.

I have a text document that looks, for example, like this:

This is a
TEXT DOCUMENT with
SOME capitalized words
BUT NOT all of them are
ALL CAPS
iPhone

What I would like is to parse this document and match only whole words made up of only uppercase letters, like so:

TEXT DOCUMENT
SOME
BUT NOT
ALL CAPS

I wrote this:

grep -o "\w[[:upper:]]\w" Untitled.txt

This gets pretty close but, alas, returns this:

TEX
DOC
UME
SOM
BUT
NOT
ALL
CAP
iPh

...which, candidly, I don't understand.

So: what might I be missing? egrep doesn't work very well under OS X because I'm limited by FreeBSD's grep (grep (BSD grep) 2.5.1-FreeBSD), I guess, so many of the solutions I've found for egrep that seem like they would work don't work as expected.


回答1:


You miss * and also \w is any word character. Correct regexp is:

\<[[:upper:]][[:upper:]]*\>

\< \> match word boundaries




回答2:


To complement Zbynek Vyskovsky - kvr000's helpful answer:

grep's -E option allows use of extended regular expression, which includes quantifier + to mean one or more, which simplifies the solution:

 grep -Eo '\<[[:upper:]]+\>' Untitled.txt

Also, as mentioned in Benjamin W.'s answer, -w can be used to match on word boundaries without having to specify it as part of the regex:

 grep -Ewo '[[:upper:]]+' Untitled.txt

Note, however, that -w is a nonstandard option (but both BSD/OSX and GNU grep implement it).


As for egrep: it is nothing more than an (effective) alias of grep -E, which, as stated, activates support for extended regular expressions, but the exact set of features is platform-dependent.

Additionally, only GNU grep supports the -P option to support PCREs (Perl-Compatible Regular Expression), which offer even more features and flexibility.




回答3:


The example output shows multiple space separated uppercase words on the same line, which can be achieved with

$ grep -ow '[[:upper:]][[:upper:][:space:]]*[[:upper:]]' infile
TEXT DOCUMENT
SOME
BUT NOT
ALL CAPS

Any sequence starting and ending with an uppercase character, and uppercase characters or whitespace between them. -o returns the matches only, and -w makes sure that we don't match something like WORDlowercase.




回答4:


You can use this command:

grep -o -E "\<[[:upper:]]+\>" Untitled.txt
  • -E activates extended regexp, this makes + available which stand for 1 or more repetitions
  • \< and \> are anchor marking the begin and end of a word
  • the whole regex means a sequence of one or more uppercase characters that made up the whole word

Your original regexp gave you three letter matches, because \w stands for [_[:alnum:]], so you instructed grep to match something which consists of three characters:

  • the first and third from the [_[:alnum:]]
  • the second from the [[:upper:]] range



回答5:


An "old school" RE would be fewer characters:

grep -o '[A-Z][A-Z]*' Untitled.txt

It uses the -o option to Only print matching words and matches against uppercase A through Z.

Adding -w to search words and -E to invoke the Extended regular expressions allows this one that is even shorter:

grep -woE '[A-Z]+\>' Untitled.txt



来源:https://stackoverflow.com/questions/35107131/grep-whole-words-made-of-only-uppercase-letters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!