问题
I'm trying to catch a section of Hebrew text (the origin is comments on a news site) using the following regex:
[\u0590-\u05FF \\p{Graph} \\s]+
It works for most comments but some comments are missed.
I've tried to debug this and it seems there's a Hebrew letter that doesn't match the pattern.
When I extract this letter and print it's integer value it seems to be correct but still the regex doesn't catch it...
Ideas?
回答1:
It would be more sematically correct to use \p{InHebrew}
instead of \u0590-\u05FF
Also you need to match punctuation, digits (at least, world-common ones) and different kind of spaces.
I don't know what is \p{Graph}
and are there any Hebrew-specific punctuation symbols, but it seemed, you missed some parts.
来源:https://stackoverflow.com/questions/8987119/how-to-capture-hebrew-with-regex-in-java