问题
So I can't figure out what's wrong with my regex here. (The original conversation, which includes an explanation of these TAG formats, can be found here: Translate from TAG format to Regex for Corpus).
I am starting with a string like this:
Arms_NNS folded_VVN ,_,
The NNS could also NN, and the VVN could also be VBG. And I just want to find that and other strings with the same tags (NNS or NN followed b VVN or VBG followed by comma).
The following regex is what I am trying to use, but it is not finding anything:
[\w-]+_(?:NN|NNS)\W+[\w-]+ _(?:VBG|VVN)\W+[\w-]+ _,
回答1:
Given the input string
Arms_NNS folded_VVN ,_,
the following regex
(\w+_(?:NN|NNS) \w+_(?:VBG|VVN) ,_,)
matches the whole string (and captures it - if you don't know what that means, that probably means it doesn't matter to you).
Given a longer string (which I made up)
Dog_NN Arms_NNS folded_VVN ,_, burp_VV
it still matches the part you want.
If the _VVN part is optional, you can use
(\w+_(?:NN|NNS) (?:\w+_(?:VBG|VVN) )?,_,)
which matches either witout, or with exactly one, word_VVN / word_VBG part.
Your more general questions:
I find it hard to explain how these things work. I'll try to explain the constituent parts:
- \w matches word characters - characters you'd normally expect to find in words
- \w* matches one-or-more of them
- (NN|NNS) means "match NN or NNS"
- ?: means "match but don't capture" - suggest googling what capturing means in relation to regexes.
- ? alone means "match 0 or 1 of the thing before me - so x? would match "" or "x" but not "xx".
- None of the characters in ,_, are special, so we can match them just by putting them in the regex.
One problem with your regex is that \w will not match a comma (only "word characters").
I don't know what [\w-] does. Looks a bit weird. I think it's probably not valid, but I don't know for sure.
My solution assumes there is exactly one space, and nothing else, between your tagged words.
来源:https://stackoverflow.com/questions/29829132/creating-more-complex-regexes-from-tag-format