Perl Regular expression | how to exclude words from a file

后端未结

关注

 5  1628

夕颜 2021-01-21 17:47

i searching to find some Perl Regular Expression Syntax about some requirements i have in a project. First i want to exclude strings from a txt file (dictionary).

For ex

5条回答

有刺的猬 (楼主)

2021-01-21 18:18
One way to exclude strings that contain words from a given list is to form a pattern with an alternation of the words and use that in a regex, and exclude strings for which it matches.
```
use warnings;
use strict;
use feature qw(say);

use Path::Tiny;

my $file = shift // die "Usage: $0 file\n";  #/

my @words = split ' ', path($file)->slurp;

my $exclude = join '|', map { quotemeta } @words;

foreach my $string (qw(a1testtre orangesh1 apleship3)) 
{ 
    if ($string !~ /$exclude/) { 
        say "OK: $string"; 
    }
}
```
I use Path::Tiny to read the file into a a string ("slurp"), which is then split by whitespace into words to use for exclusion. The quotemeta escapes non-"word" characters, should any happen in your words, which are then joined by | to form a string with a regex pattern. (With complex patterns use qr.)

This may be possible to tweak and improve, depending on your use cases, for one in regards to the order of of patterns with common parts in alternation.^†

The check that successive duplicate characters do not occur more than three times
```
foreach my $string (qw(adminnisstrator21 kkeeykloakk stack22ooverflow))
{
    my @chars_that_repeat = $string =~ /(.)\1+/g;

    if (@chars_that_repeat < 3) { 
        say "OK: $string";
    }
}
```
A long string of repeated chars (aaaa) counts as one instance, due to the + quantifier in regex; if you'd rather count all pairs remove the + and four as will count as two pairs. The same char repeated at various places in the string counts every time, so aaXaa counts as two pairs.

This snippet can be just added to the above program, which is invoked with the name of the file with words to use for exclusion. They both print what is expected from provided samples.

^† Consider an example with exclusion-words: so, sole, and solely. If you only need to check whether any one of these matches then you'd want shorter ones first in the alternation
```
my $exclude = join '|', map { quotemeta } sort { length $a <=> length $b } @words;
#==>  so|sole|solely
```
for a quicker match (so matches all three). This, by all means, appears to be the case here.

But, if you wanted to correctly identify which word matched then you must have longer words first,
```
solely|sole|so
```
so that a string solely is correctly matched by its word before it can be "stolen" by so. Then in this case you'd want it the other way round, sort { length $b <=> length $a }
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...