Why won't a longer token in an alternation be matched?

穿精又带淫゛_ 提交于 2019-12-17 21:59:23

问题


I am using ruby 2.1, but the same thing can be replicated on rubular site.

If this is my string:

儘管中國婦幼衛生監測辦公室制定的

And I do a regex match with this expression:

(中國婦幼衛生監測辦公室制定|管中)

I am expecting to get the longer token as a match.

中國婦幼衛生監測辦公室制定

Instead I get the second alternation as a match.

As far as I know it does work like that when not in chinese characters.

If this is my string:

foobar

And I use this regex:

(foobar|foo)

Returned matching result is foobar. If the order is in the other way, than the matching string is foo. That makes sense to me.


回答1:


Your assumption that regex matches a longer alternation is incorrect.

If you have a bit of time, let's look at how your regex works...

Quick refresher: How regex works: The state machine always reads from left to right, backtracking where necessary.

There are two pointers, one on the Pattern:

(cdefghijkl|bcd)

The other on your String:

abcdefghijklmnopqrstuvw

The pointer on the String moves from the left. As soon as it can return, it will:


(source: gyazo.com)

Let's turn that into a more "sequential" sequence for understanding:


(source: gyazo.com)

Your foobar example is a different topic. As I mentioned in this post:

How regex works: The state machine always reads from left to right. ,|,, == ,, as it always will only be matched to the first alternation.

    That's good, Unihedron, but how do I force it to the first alternation?

Look!*

^(?:.*?\Kcdefghijkl|.*?\Kbcd)

Here have a regex demo.

This regex first attempts to match the entire string with the first alternation. Only if it fails completely will it then attempt to match the second alternation. \K is used here to keep the match with the contents behind the construct \K.


*: \K was supported in Ruby since 2.0.0.

Read more:

  • The Stack Overflow Regex Reference
  • On greedy vs non-greedy





Ah, I was bored, so I optimized the regex:

^(?:(?:(?!cdefghijkl)c?[^c]*)++\Kcdefghijkl|(?:(?!bcd)b?[^b]*)++\Kbcd)

You can see a demo here.



来源:https://stackoverflow.com/questions/25511528/why-wont-a-longer-token-in-an-alternation-be-matched

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!