How to optionally add a comma and whitespace to a capture group?

不羁岁月 提交于 2019-12-02 04:37:17

Voici...

/^(\d+) +(\w+) +([acdefijlmnoprtv()]+(?:, ?[acdefijlmnoprtv()]+)*) +([\S\s]+?)\n\x{2022} +([\S\s]+?)\n\d+ \| [-\dn]+\s*/gum

Demo Link

I have done my best to optimize the pattern. I shaved nearly 10,000 steps off of your pattern and reached 100 matches as desired.

  • Starting anchor ^ is used to identify start of each block (Efficiency / Accuracy)
  • \d is used instead of [0-9] (Brevity)
  • \s is replaced with a literal space where applicable (Brevity)
  • A character class of specific letters and parentheses was used in place of \w for capture group 3. (Efficiency) *could be replaced with [\w()] for brevity with a loss of efficiency
  • The bullet was specified using the literal \x{2022} (Personal preference)
  • Character class used on trailing characters of each block [-\dn]. (Efficiency / Accuracy)

When you have to describe a long string with many parts, the first reflex is to use the free-space mode (x modifier) and named groups (even if named groups aren't very useful in a replacement context, they help to make the pattern readable and more easy to debug):

~^
(?<No> [0-9]+ )  \h+
(?<word> \pL+ )  \h+
(?<type> [\pL()]+ (?: , \h* [\pL()]+ )* )  \h+
(?<wd_tr> [^•]* [^•\s] )  \h* \R

• \h*
(?<sent_fr> [^–]* [^\s–] )   \s* – \s*
(?<sent_eng> .* (?:\R .*)*? )  \h* \R

(?<num1> [0-9]+ )  \h* \| \h*
(?<num2> .*\S )
~xum

demo

There are no magic recipe to build a pattern for a string with a blurred format. All you can do is to be the most constrictive at the beginning and to add flexibility when you encounter cases that don't match.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!