Regex includes two matches in first match

问题

I have this regex that tries to find individual STEP-lines and divides it into three goups of reference number, class and properties:

#14=IFCEXTRUDEDAREASOLID(#28326,#17,#9,3657.6);

becomes

[['14'], ['IFCEXTRUDEDAREASOLID'], ['#28326,#17,#9,3657.6']]

Sometimes these lines have arbitrary line breaks, especially among the properties, so I put some \s in the regex. This however makes for an interesting bug. The pattern now matches TWO rows into every match.

How can I adjust the regex to only catch one row even if they have line breaks? And just for curiosity, why does it stop after the second line and not continuing until last line?

回答1:

The reason why you now match 2 lines every time is that \s matches any whitespace, and if there is a line break after a line matched, the \s* will grab them all.

Use

/^#(\d+)\s*=\s*([a-zA-Z0-9]+)\s*\(((?:'[^']*'|[^;'])+)\);/gm

See this regex demo

Details:

^ - start of a line
# - a hash symbol
(\d+) - Group 1: one or more digits
\s*=\s* - a = enclosed with optional whitespaces
([a-zA-Z0-9]+) - Group 2 capturing 1+ alphanumerics
\s*\( - 0+ whitespaces and a (
((?:'[^']*'|[^;'])+) - Group 3 capturing either '...' substrings ('[^']*', with no ' inside allowed) or (|) 1+ chars other than ; and ' ([^;']+)
\); - a ); sequence

A negated character class solution suggested by Maverick_Mrt is good for specific cases, but once the text captured with ([\s\S]*?) contains the negated char, the match will get failed.

回答2:

You can try this:

#(\d+)\s*=\s*([a-z0-9]+)\s*\([^;]*\);

Your updated link

来源：https://stackoverflow.com/questions/41715407/regex-includes-two-matches-in-first-match

标签

regex

step

ifc