Currently i am working on an application that splits a long column into short ones. For that i split the entire text into words, but at the moment my regex splits numbers to
Use lookahead to avoid replacing dot if not followed by space + word char:
sentences = str.replace(/(?=\s*\w)\./g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");
OUTPUT:
["This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."]
you forgot to put '\s' in your regexp.
try this one
var str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
var sentences = str.replace(/\.\s+/g,'.|').replace(/\?\s/g,'?|').replace(/\!\s/g,'!|').split("|");
console.log(sentences[0]);
console.log(sentences[1]);
http://jsfiddle.net/hrRrW/
str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")
Output:
[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
'This is another sentence.' ]
Breakdown:
([.?!])
= Capture either .
or ?
or !
\s*
= Capture 0 or more whitespace characters following the previous token ([.?!])
. This accounts for spaces following a punctuation mark which matches the English language grammar.
(?=[A-Z])
= The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account.
The replace operation uses:
"$1|"
We used one "capturing group" ([.?!])
and we capture one of those characters, and replace it with $1
(the match) plus |
. So if we captured ?
then the replacement would be ?|
.
Finally, we split the pipes |
and get our result.
So, essentially, what we are saying is this:
1) Find punctuation marks (one of .
or ?
or !
) and capture them
2) Punctuation marks can optionally include spaces after them.
3) After a punctuation mark, I expect a capital letter.
Unlike the previous regular expressions provided, this would properly match the English language grammar.
From there:
4) We replace the captured punctuation marks by appending a pipe |
5) We split the pipes to create an array of sentences.
You're safer using lookahead to make sure what follows after the dot is not a digit.
var str ="This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."
var sentences = str.replace(/\.(?!\d)/g,'.|');
console.log(sentences);
If you want to be even safer you could check if what is behind is a digit as well, but since JS doesn't support lookbehind, you need to capture the previous character and use it in the replace string.
var str ="This is another sentence.1 is a good number"
var sentences = str.replace(/\.(?!\d)|([^\d])\.(?=\d)/g,'$1.|');
console.log(sentences);
An even simpler solution is to escape the dots inside numbers (replace them with $$$$ for example), do the split and afterwards unescape the dots.
str.replace(/(\.+|\:|\!|\?)(\"*|\'*|\)*|}*|]*)(\s|\n|\r|\r\n)/gm, "$1$2|").split("|")
The RegExp (see on Debuggex):
Remarks:
@Roger Poon and @Antonín Slejška 's answers work good.
It'd better if we add trim function and filter empty string:
const splitBySentence = (str) => {
return str.replace(/([.?!])(\s)*(?=[A-Z])/g, "$1|")
.split("|")
.filter(sentence => !!sentence)
.map(sentence => sentence.trim());
}
const splitBySentence = (str) => {
return str.replace(/([.?!])(\s)*(?=[A-Z])/g, "$1|").split("|").filter(sentence => !!sentence).map(sentence => sentence.trim());
}
const content = `
The Times has identified the following reporting anomalies or methodology changes in the data for New York:
May 6: New York State added many deaths from unspecified days after reconciling data from nursing homes and other care facilities.
June 30: New York City released deaths from earlier periods but did not specify when they were from.
Aug. 6: Our database changed to record deaths by New York City residents instead of deaths that took place in New York City.
Aug. 20: New York City removed four previously reported deaths after reviewing records. The state reported four new deaths in other counties.(extracted from NY Times)
`;
console.log(splitBySentence(content));