.net regex - strings that don't contain full stop on last list item

馋奶兔 提交于 2020-02-06 07:56:10

问题


I'm trying to use .net regex for identifying strings in XML data that don't contain a full stop before the last tag. I have not much experience with regex. I'm not sure what I need to change & why to get the result I'm looking for.

There are line breaks and carriage returns at end of each line in the data.

A schema is used for the XML.

Example of good XML Data:

<randlist prefix="unorder">
    <item>abc</item>
    <item>abc</item>
    <item>abc.</item>
</randlist>

Example of bad XML Data - regexp should give matches - no full stop preceding last </item>:

<randlist prefix="unorder">
    <item>abc</item>
    <item>abc</item>
    <item>abc</item>
</randlist>

Reg exp pattern I tried that didn't work in the bad XML data (not tested on good XML data):

^<randlist \w*=[\S\s]*\.*[^.]<\/item>[\n]*<\/randlist>$

Results using http://regexstorm.net/tester:

0 matches

Results using https://regex101.com/:

0 matches

This question is different to the following imo, due to full stop and start of string criteria:

Regex for string not ending with given suffix

Explanation from 3:

/
^<randlist \w*=[\S\s]*\.*[^.]<\/item>[\n]*<\/randlist>$
/
gm
^ asserts position at start of a line
<randlist  matches the characters <randlist  literally (case sensitive)
\w* matches any word character (equal to [a-zA-Z0-9_])
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
= matches the character = literally (case sensitive)
Match a single character present in the list below [\S\s]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\S matches any non-whitespace character (equal to [^\r\n\t\f\v ])
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\.* matches the character . literally (case sensitive)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character not present in the list below [^.]
. matches the character . literally (case sensitive)
< matches the character < literally (case sensitive)
\/ matches the character / literally (case sensitive)
item> matches the characters item> literally (case sensitive)
Match a single character present in the list below [\n]*
< matches the character < literally (case sensitive)
\/ matches the character / literally (case sensitive)
randlist> matches the characters randlist> literally (case sensitive)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

回答1:


@Silvanas is absolutely correct. You should not use Regex for this problem, you should use some form of XML parser to read the data and find the lines with .. However, if for some horrible reason you MUST use Regex, and If your data is structured exactly like your example, then the Regex solution would be the following:

^\s+<item>[^<]*?(?<=\.)<\/item>$

If there ARE any matches with that regex, your xml is malformed. But again, this regex fails if the whitespace isn't correct, if there's anything else on the line, if the tags arent <item>..</item>, and so on and so on. Again, you would be far, far better off not using Regex for this problem unless you can absolutely guarantee that everything but the . is going to be well-formed XML

EDIT: If the opening and closing tag are on the same line, but it isn't necessarily titled 'item', and may have attributes, go ahead and try the following:

^\s+<([^<>\s]+)[^<>]*>[^<>]*?(?<=\.)<\/\1>$

Breakdown:
^           anchor to beginning of line
\s+         skip over any whitespace
<           found what looks like an opening tag
([^[]\s]+)  match the first word found after the "<", store in capture group 1
[^<>]*>     match whatever remain until the closing ">"
[^<>]*?     match all of the contents up until the next "<"
(?<=\.)     ensure the last character was a "."
<\/\1>      match a closing tag where the text after the / is the same as the first word of the opening tag (stored in capture group 1)
$           anchor to end of line

Make sure you have the MultiLine regex option set, otherwise ^ and $ will match the beginning/end of the entire string. As with before, any matches with this regex mean the XML is poorly formed on that line.



来源:https://stackoverflow.com/questions/59846649/net-regex-strings-that-dont-contain-full-stop-on-last-list-item

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!