Perl Regular expression | how to exclude words from a file

后端未结

关注

 5  1633

i searching to find some Perl Regular Expression Syntax about some requirements i have in a project. First i want to exclude strings from a txt file (dictionary).

For ex

相关标签:

5条回答

你的背包

2021-01-21 18:03

As mentioned in comment to @zdim's answer, take it a bit further by making sure that the order in which your words are assembled into the match pattern doesn't trip you. If the words in the file are not very carefully ordered to start, I use a subroutine like this when building the match string:

# Returns a list of alternative match patterns in tight matching order.
# E.g., TRUSTEES before TRUSTEE before TRUST   
# TRUSTEES|TRUSTEE|TRUST

sub tight_match_order {
    return @_ unless @_ > 1;
    my (@alts, @ordered_alts, %alts_seen);
    @alts   = map { $alts_seen{$_}++ ? () : $_ } @_;
    TEST: {
        my $alt = shift @alts;
        if (grep m#$alt#, @alts) {
            push @alts => $alt;
        } else {
            push @ordered_alts => $alt;
        }
        redo TEST if @alts;
    }
    @ordered_alts
}

So following @zdim's answer:

...
my @words = split ' ', path($file)->slurp;

@words = tight_match_order(@words); # add this line

my $exclude = join '|', map { quotemeta } @words;
...

HTH

0 讨论(0)

一整个雨季

2021-01-21 18:05
To not match a word from a file you might check whether a string contains a substring or use a negative lookahead and an alternation:
```
^(?!.*(?:tree|car|ship)).*$
```
- ^ Assert start of string
- (?! negative lookahead, assert what is on the right is not
  - .*(?:tree|car|ship) Match 0+ times any char except a newline and match either tree car or ship
- ) Close negative lookahead
- .* Match any char except a newline
- $ Assert end of string
Regex demo

To not allow a string to have over 3 times a char repeat you could use:
```
\b(?!(?:\w*(\w)\1){3})\w+\b
```
- \b Word boundary
- (?! Negative lookahead, assert what is on the right is not
  - (?: NOn capturing group
  - \w*(\w)\1 Match 0+ times a word character followed by capturing a word char in a group followed by a backreference using \1 to that group
  - ){3} Close non capturing group and repeat 3 times
- ) close negative lookahead
- \w+ Match 1+ word characters
- \b word boundary
Regex demo

Update

According to this posted answer (which you might add to the question instead) you have 2 patterns that you want to combine but it does not work:
```
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
```
In those 2 patterns you use 2 capturing groups, so the second pattern has to point to the second capturing group \2.
```
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\2){4}))*$)
                                               ^  
```
Pattern demo
0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2021-01-21 18:18
One way to exclude strings that contain words from a given list is to form a pattern with an alternation of the words and use that in a regex, and exclude strings for which it matches.
```
use warnings;
use strict;
use feature qw(say);

use Path::Tiny;

my $file = shift // die "Usage: $0 file\n";  #/

my @words = split ' ', path($file)->slurp;

my $exclude = join '|', map { quotemeta } @words;

foreach my $string (qw(a1testtre orangesh1 apleship3)) 
{ 
    if ($string !~ /$exclude/) { 
        say "OK: $string"; 
    }
}
```
I use Path::Tiny to read the file into a a string ("slurp"), which is then split by whitespace into words to use for exclusion. The quotemeta escapes non-"word" characters, should any happen in your words, which are then joined by | to form a string with a regex pattern. (With complex patterns use qr.)

This may be possible to tweak and improve, depending on your use cases, for one in regards to the order of of patterns with common parts in alternation.^†

The check that successive duplicate characters do not occur more than three times
```
foreach my $string (qw(adminnisstrator21 kkeeykloakk stack22ooverflow))
{
    my @chars_that_repeat = $string =~ /(.)\1+/g;

    if (@chars_that_repeat < 3) { 
        say "OK: $string";
    }
}
```
A long string of repeated chars (aaaa) counts as one instance, due to the + quantifier in regex; if you'd rather count all pairs remove the + and four as will count as two pairs. The same char repeated at various places in the string counts every time, so aaXaa counts as two pairs.

This snippet can be just added to the above program, which is invoked with the name of the file with words to use for exclusion. They both print what is expected from provided samples.

^† Consider an example with exclusion-words: so, sole, and solely. If you only need to check whether any one of these matches then you'd want shorter ones first in the alternation
```
my $exclude = join '|', map { quotemeta } sort { length $a <=> length $b } @words;
#==>  so|sole|solely
```
for a quicker match (so matches all three). This, by all means, appears to be the case here.

But, if you wanted to correctly identify which word matched then you must have longer words first,
```
solely|sole|so
```
so that a string solely is correctly matched by its word before it can be "stolen" by so. Then in this case you'd want it the other way round, sort { length $b <=> length $a }
0 讨论(0)
发布评论:

提交评论
- 加载中...

傲寒

2021-01-21 18:23

I hope someone else will come with a better solution, but this seems to do what you want:

\b                          Match word boundary
  (?:                       Start capture group
    (?:([a-z0-9])(?!\1))*   Match all characters until it encounters a double
    (?:([a-z0-9])\2)+       Match all repeated characters until a different one is reached
  ){0,2}                    Match capture group 0 or 2 times
  (?:([a-z0-9])(?!\3))+     Match all characters until it encounters a double
\b                          Match end of word

I changed the [a-z] to also match numbers, since the examples you gave seem to also include numbers. Perl regex also has the \w shorthand, which is equivalent to [A-Za-z0-9_], which could be handy if you want to match any character in a word.

0 讨论(0)

感动是毒

2021-01-21 18:25
My problem is that i have 2 regex that working:

Not allow over 3 pairs of chars:
```
          (?=^(?!(?:\w*(.)\1){3}).+$)
```
Not allow over 4 times a char to repeat:
```
        (?=^(?:(.)(?!(?:.*?\1){4}))*$)
```
Now i want to combine them into one row like:
```
      (?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
```
but its working only the regex that is first and not both of them
0 讨论(0)
发布评论:

提交评论
- 加载中...