gawk regex to find any record having characters other then the specified by character class in regex pattern

隐身守侯 提交于 2020-06-02 23:34:49

问题


I have list of email addresses in a text file. I have a pattern having character classes that specifies what characters are allowed in the email addresses. Now from that input file, I want to only search the email addresses that has the characters other than the allowed ones. I am trying to write a gawk for the same, but not able to get it to work properly. Here is the gawk that I am trying:

gawk -F "," ' $2!~/[[:alnum:]@\.]]/ { print "has invalid chars" }' emails.csv

The problem I am facing is that the above gawk command only matches the records that has NONE of the alphanumeric, @ and . (dot) in them. But what I am looking for is the records that are having the allowed characters but along with them the not-allowed ones as well.

For example, the above command would find

"_-()&(()%"

as the above only has the characters not in regex pattern, but will not find

"abc-123@xyz,com"

. as it also has the characters that are present in specified character classes in regex pattern.


回答1:


How about several tests together: contains an alnum and an @ and a dot and an invalid character

$2 ~ /[[:alnum:]]/ && $2 ~ /@/ && $2 ~ /\./ && $2 ~ /[^[:alnum:]@.]/



回答2:


Your regex is wrong here:

/[[:alnum:]@\.]]/

It should be:

/[[:alnum:]@.]/

Not removal of an extra ] fron end.

Test Case:

# regex with extra ]
awk -F "," '{print ($2 !~ /[[:alnum:]@.]]/)}' <<< 'abc,ab@email.com'
1

# correct regex
awk -F "," '{print ($2 !~ /[[:alnum:]@.]/)}' <<< 'abc,ab@email.com'
0



回答3:


Do you really care whether the string has a valid character? If not (and it seems like you don't), the simple solution is

$2 ~ /[^[:alnum:]@.]/{ print "has invalid chars" }

That won't trigger on an empty string, so you might want to add a test for that case.




回答4:


Your question would REALLY benefit from some concise, testable sample input and expected output as right now we're all guessing at what you want but maybe this does it?

awk -F, '{r=$2} gsub(/[[:alnum:]@.]/,"",r) && (r!="") { print "has invalid chars" }' emails.csv

e.g. using the 2 input examples you provided:

$ cat file
_-()&(()%
abc-123@xyz,com

$ awk '{r=$0} gsub(/[[:alnum:]@.]/,"",r) && (r!="") { print $0, "has invalid chars" }' file
abc-123@xyz,com has invalid chars

There are more accurate email regexps btw, e.g.:

\<[[:alnum:]._%+-]+@[[:alnum:]_.-]+\.[[:alpha:]]{2,}\>

which is a gawk-specific (for word delimiters \< and \>) modification of the one described at http://www.regular-expressions.info/email.html after updating to use POSIX character classes.

If you are trying to validate email addresses do not use the regexp you started with as it will declare @ and 7 to each be valid email addresses.

See also How to validate an email address using a regular expression? for more email regexp details.



来源:https://stackoverflow.com/questions/40046139/gawk-regex-to-find-any-record-having-characters-other-then-the-specified-by-char

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!