i searching to find some Perl Regular Expression Syntax about some requirements i have in a project. First i want to exclude strings from a txt file (dictionary).
For ex
One way to exclude strings that contain words from a given list is to form a pattern with an alternation of the words and use that in a regex, and exclude strings for which it matches.
use warnings;
use strict;
use feature qw(say);
use Path::Tiny;
my $file = shift // die "Usage: $0 file\n"; #/
my @words = split ' ', path($file)->slurp;
my $exclude = join '|', map { quotemeta } @words;
foreach my $string (qw(a1testtre orangesh1 apleship3))
{
if ($string !~ /$exclude/) {
say "OK: $string";
}
}
I use Path::Tiny to read the file into a a string ("slurp"), which is then split by whitespace into words to use for exclusion. The quotemeta escapes non-"word" characters, should any happen in your words, which are then joined by |
to form a string with a regex pattern. (With complex patterns use qr.)
This may be possible to tweak and improve, depending on your use cases, for one in regards to the order of of patterns with common parts in alternation.†
The check that successive duplicate characters do not occur more than three times
foreach my $string (qw(adminnisstrator21 kkeeykloakk stack22ooverflow))
{
my @chars_that_repeat = $string =~ /(.)\1+/g;
if (@chars_that_repeat < 3) {
say "OK: $string";
}
}
A long string of repeated chars (aaaa
) counts as one instance, due to the +
quantifier in regex; if you'd rather count all pairs remove the +
and four a
s will count as two pairs. The same char repeated at various places in the string counts every time, so aaXaa
counts as two pairs.
This snippet can be just added to the above program, which is invoked with the name of the file with words to use for exclusion. They both print what is expected from provided samples.
† Consider an example with exclusion-words: so
, sole
, and solely
. If you only need to check whether any one of these matches then you'd want shorter ones first in the alternation
my $exclude = join '|', map { quotemeta } sort { length $a <=> length $b } @words;
#==> so|sole|solely
for a quicker match (so
matches all three). This, by all means, appears to be the case here.
But, if you wanted to correctly identify which word matched then you must have longer words first,
solely|sole|so
so that a string solely
is correctly matched by its word before it can be "stolen" by so
. Then in this case you'd want it the other way round,
sort { length $b <=> length $a }