Java regular expression to match _all_ whitespace characters

前端 未结 7 389
别那么骄傲
别那么骄傲 2020-12-08 02:36

I\'m looking for a regular expression in Java which matches all whitespace characters in a String. \"\\s\" matches only some, it does not match   and s

相关标签:
7条回答
  • 2020-12-08 02:54

      is not white space. It is a character encoding sequence that represents whitespace in HTML. You most likely want to convert HTML encoded text into plain text before running your string match against it. If that is the case, go look up javax.swing.text.html

    0 讨论(0)
  • 2020-12-08 02:54

    The regex characters are the only ones independent of encoding. Here is a list of some characters which - in Unicode - are non-printing:

    How many non-printing characters are in common use?

    0 讨论(0)
  • 2020-12-08 02:57

      is not a whitespace character, as far as regexpes are concerned. You need to either modify the regexp to include those strings in addition to \s, like /(\s| |%20)/, or previously parse the string contents to get the ASCII or Unicode representation of the data.

    You are mixing abstraction levels here.

    If, what after a careful reread of the question seems to be the case, you are after a way to match all whitespace characters referring to standard ASCII plus the whitespace codepoints, \p{Z} or \p{Zs} will do the work.

    You should really clarify your question because it has misled a lot of people (even making the correct answer to have some downvotes).

    0 讨论(0)
  • 2020-12-08 02:58

    The   is only whitespace in HTML. Use an HTML parser to extract the plain text. and \s should work just fine.

    0 讨论(0)
  • 2020-12-08 02:58

    You clarified the question the way as I expected: you're actually not looking for the String literal   as many here seem to think and for which the solution is too obvious.

    Well, unfortunately, there's no way to match them using regex. Best is to include the particular codepoints in the pattern, for example: "[\\s\\xA0]".

    Edit as turned out in one of the comments, you could use the undocumented "\\p{Z}" for this. Alan, can you please leave comment how you found that out? This one is quite useful.

    0 讨论(0)
  • 2020-12-08 03:05

    Here's a summary I made of several competing definitions of "whitespace":

    http://spreadsheets.google.com/pub?key=pd8dAQyHbdewRsnE5x5GzKQ

    You might end up having to explicitly list the additional ones you care about that aren't matched by one of the prefab ones.

    0 讨论(0)
提交回复
热议问题