Regular expression to match boundary between different Unicode scripts

前端 未结 1 838
一个人的身影
一个人的身影 2021-01-12 02:24

Regular expression engines have a concept of \"zero width\" matches, some of which are useful for finding edges of words:

  • \\b - present in most en
相关标签:
1条回答
  • 2021-01-12 02:36

    EDIT: I just noticed you didn’t actually specify which pattern-matching language you were using. Well, I hope a Perl solution will work for you, since the needed mechanations are likely to be really tough in any other language. Plus if you’re doing pattern matching with Unicode, Perl really is the best choice available for that particular kind of work.


    When the $rx variable below is set to the appropriate pattern, this little snippet of Perl code:

    my $data = "foo1 and Πππ 語語語 done";
    
    while ($data =~ /($rx)/g) {
       print "Got string: '$1'\n"; 
    } 
    

    Generates this output:

    Got string: 'foo1 and '
    Got string: 'Πππ '
    Got string: '語語語 '
    Got string: 'done'
    

    That is, it pulls out a Latin string, a Greek string, a Han string, and another Latin string. This is pretty darned closed to what I think you actually need.

    The reason I didn’t post this yesterday is that I was getting weird core dumps. Now I know why.

    My solution uses lexical variables inside of a (??{...}) construct. Turns out that that is unstable before v5.17.1, and at best worked only by accident. It fails on v5.17.0, but succeeds on v5.18.0 RC0 and RC2. So I’ve added a use v5.17.1 to make sure you’re running something recent enough to trust with this approach.

    First, I decided that you didn’t actually want a run of all the same script type; you wanted a run of all the same script type plus Common and Inherited. Otherwise you will get messed up by punctuation and whitespace and digits for Common, and by combining characters for Inherited. I really don’t think you want those to interrupt your run of “all the same script”, but if you do, it’s easy to stop considering those.

    So what we do is lookahead for the first character that has a script type of other than Common or Inherited. More than that, we extract from it what that script type actually is, and use this information to construct a new pattern that is any number of characters whose script type is either Common, Inherited, or whatever script type we just found and saved off. Then we evaluate that new pattern and continue.

    Hey, I said it was hairy, didn’t I?

    In the program I’m about to show, I’ve left in some commented-out debugging statements that show just what it’s doing. If you uncomment them, you get this output for the last run, which should help understand the approach:

    DEBUG: Got peekahead character f, U+0066
    DEBUG: Scriptname is Latin
    DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}
    Got string: 'foo1 and '
    DEBUG: Got peekahead character Π, U+03a0
    DEBUG: Scriptname is Greek
    DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Greek}]*}
    Got string: 'Πππ '
    DEBUG: Got peekahead character 語, U+8a9e
    DEBUG: Scriptname is Han
    DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Han}]*}
    Got string: '語語語 '
    DEBUG: Got peekahead character d, U+0064
    DEBUG: Scriptname is Latin
    DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}
    Got string: 'done'
    

    And here at last is the big hairy deal:

    use v5.17.1;
    use strict;
    use warnings;
    use warnings FATAL => "utf8";
    use open qw(:std :utf8);
    use utf8;
    
    use Unicode::UCD qw(charscript);
    
    # regex to match a string that's all of the
    # same Script=XXX type
    #
    my $rx = qr{
        (?=
           [\p{Script=Common}\p{Script=Inherited}] *
            (?<CAPTURE>
                [^\p{Script=Common}\p{Script=Inherited}]
            )
        )
        (??{
            my $capture = $+{CAPTURE};
       #####printf "DEBUG: Got peekahead character %s, U+%04x\n", $capture, ord $capture;
            my $scriptname = charscript(ord $capture);
       #####print "DEBUG: Scriptname is $scriptname\n";
            my $run = q([\p{Script=Common}\p{Script=Inherited}\p{Script=)
                    . $scriptname
                    . q(}]*);
       #####print "DEBUG: string to re-interpolate as regex is q{$run}\n";
            $run;
        })
    }x;
    
    
    my $data = "foo1 and Πππ 語語語 done";
    
    $| = 1;
    
    while ($data =~ /($rx)/g) {
       print "Got string: '$1'\n";
    }
    

    Yeah, there oughta be a better way. I don’t think there is—yet.

    So for now, enjoy.

    0 讨论(0)
提交回复
热议问题