Unicode regexp to match line-breaks?

问题

I have this form from where I want to submit data to a database. The data is UTF8. I am having trouble with matching line breaks. The pattern I am using is something like this:

~^[\p{L}\p{M}\p{N} ]+$~u

This pattern works fine until the user puts a new line in his text box. I have tried using \p{Z} inside the class but with no success. I also tried "s" but it didn’t work.

Any help is much appreciated. Thanks!

回答1:

A Unicode linebreak is either a carriage return immediately followed by a line feed, or else it is any character with the vertical whitespace property.

But it looks like you’re trying to match generic whitespace there. In Java, that would be

 [\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u2029\u202F\u205F\u3000]

which can be shortened by using ranges to “only” this:

 [\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

to include both horizontal whitespace (\h) and vertical whitespace (\v), which may or may not be the same as general whitespace (\s).

It also looks like you’re trying to match alphanumerics.

Alphabetics alone are usually [\pL\pM\p{Nl}].
Numerics are not so often all \pN as often as they are either just \p{Nd} or else sometimes [\p{Nd}\p{Nl}].
Identifer characters need connector punctuation and a bit more, so [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] — if your regex engine supports those sorts of operations (Java’s does). That’s what \w works out to in Unicode-aware regex languages (of which Java is not one).

In older versions of Perl, you would likely write a linebreak as

 (?:\r\n|\p{VertSpace})

although that’s now better written as

 (?:(?>\r\n)|\v)

which is exactly what

\R

matches.

Java is very clumsy at these things. There you must write a linebreak as

  (?:(?>\u000D\u000A)|[\u000A-\u000D\u0085\u2028\u2029])

which of course requires extra bbaacckkssllasshheess when written as a string.

The other Java equivalences for the 14 common character-class regex escapes so that they work with Unicode I give in this answer. You may have to use those in other Java-like regex languages that aren’t sufficiently Unicode-aware.

来源：https://stackoverflow.com/questions/4388630/unicode-regexp-to-match-line-breaks

标签

regex

unicode

character-properties

line-breaks