问题
I want to be replace any occurrence of more than one space with a single space, but take no action in text between quotes.
Is there any way of doing this with a Java regex? If so, can you please attempt it or give me a hint?
回答1:
Here's another approach, that uses a lookahead to determine that all quotation marks after the current position come in matched pairs.
text = text.replaceAll(" ++(?=(?:[^\"]*+\"[^\"]*+\")*+[^\"]*+$)", " ");
If needed, the lookahead can be adapted to handle escaped quotation marks inside the quoted sections.
回答2:
When trying to match something that can be contained within something else, it can be helpful to construct a regular expression that matches both, like this:
("[^"\\]*(?:\\.[^"\\]*)*")|( +)
This will match a quoted string or two or more spaces. Because the two expressions are combined, it will match a quoted string OR two or more spaces, but not spaces within quotes. Using this expression, you will need to examine each match to determine if it is a quoted string or two or more spaces and act accordingly:
Pattern spaceOrStringRegex = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|( +)" );
StringBuffer replacementBuffer = new StringBuffer();
Matcher spaceOrStringMatcher = spaceOrStringRegex.matcher( text );
while ( spaceOrStringMatcher.find() )
{
// if the space group is the match
if ( spaceOrStringMatcher.group( 2 ) != null )
{
// replace with a single space
spaceOrStringMatcher.appendReplacement( replacementBuffer, " " );
}
}
spaceOrStringMatcher.appendTail( replacementBuffer );
回答3:
text between quotes : Are the quotes within the same line or multiple lines ?
回答4:
Tokenize it and emit a single space between tokens. A quick google for "java tokenizer that handles quotes" turned up: this link
YMMV
edit: SO didn't like that link. Here's the google search link: google. It was the first result.
回答5:
Personally, I don't use Java, but this RegExp could do the trick:
([^\" ])*(\\\".*?\\\")*
Trying the expression with RegExBuddy, it generates this code, looks fine to me:
try {
Pattern regex = Pattern.compile("([^\" ])*(\\\".*?\\\")*", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
for (int i = 1; i <= regexMatcher.groupCount(); i++) {
// matched text: regexMatcher.group(i)
// match start: regexMatcher.start(i)
// match end: regexMatcher.end(i)
// I suppose here you must use something like
// sstr += regexMatcher.group(i) + " "
}
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
At least, it seems to work fine in Python:
import re
text = """
este es un texto de prueba "para ver como se comporta " la funcion sobre esto
"para ver como se comporta " la funcion sobre esto "o sobre otro" lo q sea
"""
ret = ""
print text
reobj = re.compile(r'([^\" ])*(\".*?\")*', re.IGNORECASE)
for match in reobj.finditer(text):
if match.group() <> "":
ret = ret + match.group() + "|"
print ret
回答6:
After you parse out the quoted content, run this on the rest, in bulk or piece by piece as necessary:
String text = "ABC DEF GHI JKL";
text = text.replaceAll("( )+", " ");
// text: "ABC DEF GHI JKL"
回答7:
Jeff, you're on the right track, but there are a few errors in your code, to wit: (1) You forgot to escape the quotation marks inside the negated character classes; (2) The parens inside the first capturing group should have been of the non-capturing variety; (3) If the second set of capturing parens doesn't participate in a match, group(2)
returns null, and you're not testing for that; and (4) If you test for two or more spaces in the regex instead of one or more, you don't need to check the length of the match later on. Here's the revised code:
import java.util.regex.*;
public class Test
{
public static void main(String[] args) throws Exception
{
String text = "blah blah \"boo boo boo\" blah blah";
Pattern p = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|( +)" );
StringBuffer sb = new StringBuffer();
Matcher m = p.matcher( text );
while ( m.find() )
{
if ( m.group( 2 ) != null )
{
m.appendReplacement( sb, " " );
}
}
m.appendTail( sb );
System.out.println( sb.toString() );
}
}
来源:https://stackoverflow.com/questions/263985/regex-question-one-or-more-spaces-outside-of-a-quote-enclosed-block-of-text