问题
I have list of email addresses in a text file. I have a pattern having character classes that specifies what characters are allowed in the email addresses. Now from that input file, I want to only search the email addresses that has the characters other than the allowed ones. I am trying to write a gawk for the same, but not able to get it to work properly. Here is the gawk that I am trying:
gawk -F "," ' $2!~/[[:alnum:]@\.]]/ { print "has invalid chars" }' emails.csv
The problem I am facing is that the above gawk command only matches the records that has NONE of the alphanumeric, @ and . (dot) in them. But what I am looking for is the records that are having the allowed characters but along with them the not-allowed ones as well.
For example, the above command would find
"_-()&(()%"
as the above only has the characters not in regex pattern, but will not find
"abc-123@xyz,com"
. as it also has the characters that are present in specified character classes in regex pattern.
回答1:
How about several tests together: contains an alnum and an @ and a dot and an invalid character
$2 ~ /[[:alnum:]]/ && $2 ~ /@/ && $2 ~ /\./ && $2 ~ /[^[:alnum:]@.]/
回答2:
Your regex is wrong here:
/[[:alnum:]@\.]]/
It should be:
/[[:alnum:]@.]/
Not removal of an extra ]
fron end.
Test Case:
# regex with extra ]
awk -F "," '{print ($2 !~ /[[:alnum:]@.]]/)}' <<< 'abc,ab@email.com'
1
# correct regex
awk -F "," '{print ($2 !~ /[[:alnum:]@.]/)}' <<< 'abc,ab@email.com'
0
回答3:
Do you really care whether the string has a valid character? If not (and it seems like you don't), the simple solution is
$2 ~ /[^[:alnum:]@.]/{ print "has invalid chars" }
That won't trigger on an empty string, so you might want to add a test for that case.
回答4:
Your question would REALLY benefit from some concise, testable sample input and expected output as right now we're all guessing at what you want but maybe this does it?
awk -F, '{r=$2} gsub(/[[:alnum:]@.]/,"",r) && (r!="") { print "has invalid chars" }' emails.csv
e.g. using the 2 input examples you provided:
$ cat file
_-()&(()%
abc-123@xyz,com
$ awk '{r=$0} gsub(/[[:alnum:]@.]/,"",r) && (r!="") { print $0, "has invalid chars" }' file
abc-123@xyz,com has invalid chars
There are more accurate email regexps btw, e.g.:
\<[[:alnum:]._%+-]+@[[:alnum:]_.-]+\.[[:alpha:]]{2,}\>
which is a gawk-specific (for word delimiters \<
and \>
) modification of the one described at http://www.regular-expressions.info/email.html after updating to use POSIX character classes.
If you are trying to validate email addresses do not use the regexp you started with as it will declare @
and 7
to each be valid email addresses.
See also How to validate an email address using a regular expression? for more email regexp details.
来源:https://stackoverflow.com/questions/40046139/gawk-regex-to-find-any-record-having-characters-other-then-the-specified-by-char