Split string into sentences in javascript

后端 未结 8 1076
悲哀的现实
悲哀的现实 2020-11-29 06:08

Currently i am working on an application that splits a long column into short ones. For that i split the entire text into words, but at the moment my regex splits numbers to

相关标签:
8条回答
  • 2020-11-29 06:23

    Use lookahead to avoid replacing dot if not followed by space + word char:

    sentences = str.replace(/(?=\s*\w)\./g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");
    

    OUTPUT:

    ["This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."]
    
    0 讨论(0)
  • 2020-11-29 06:26

    you forgot to put '\s' in your regexp.

    try this one

    var str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
    var sentences = str.replace(/\.\s+/g,'.|').replace(/\?\s/g,'?|').replace(/\!\s/g,'!|').split("|");
    console.log(sentences[0]);
    console.log(sentences[1]);
    

    http://jsfiddle.net/hrRrW/

    0 讨论(0)
  • 2020-11-29 06:27
    str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")
    

    Output:

    [ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
      'This is another sentence.' ]
    

    Breakdown:

    ([.?!]) = Capture either . or ? or !

    \s* = Capture 0 or more whitespace characters following the previous token ([.?!]). This accounts for spaces following a punctuation mark which matches the English language grammar.

    (?=[A-Z]) = The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account.


    The replace operation uses:

    "$1|"
    

    We used one "capturing group" ([.?!]) and we capture one of those characters, and replace it with $1 (the match) plus |. So if we captured ? then the replacement would be ?|.

    Finally, we split the pipes | and get our result.


    So, essentially, what we are saying is this:

    1) Find punctuation marks (one of . or ? or !) and capture them

    2) Punctuation marks can optionally include spaces after them.

    3) After a punctuation mark, I expect a capital letter.

    Unlike the previous regular expressions provided, this would properly match the English language grammar.

    From there:

    4) We replace the captured punctuation marks by appending a pipe |

    5) We split the pipes to create an array of sentences.

    0 讨论(0)
  • 2020-11-29 06:39

    You're safer using lookahead to make sure what follows after the dot is not a digit.

    var str ="This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."
    
    var sentences = str.replace(/\.(?!\d)/g,'.|');
    console.log(sentences);
    

    If you want to be even safer you could check if what is behind is a digit as well, but since JS doesn't support lookbehind, you need to capture the previous character and use it in the replace string.

    var str ="This is another sentence.1 is a good number"
    
    var sentences = str.replace(/\.(?!\d)|([^\d])\.(?=\d)/g,'$1.|');
    console.log(sentences);
    

    An even simpler solution is to escape the dots inside numbers (replace them with $$$$ for example), do the split and afterwards unescape the dots.

    0 讨论(0)
  • 2020-11-29 06:40
    str.replace(/(\.+|\:|\!|\?)(\"*|\'*|\)*|}*|]*)(\s|\n|\r|\r\n)/gm, "$1$2|").split("|")
    

    The RegExp (see on Debuggex):

    • (.+|:|!|\?) = The sentence can end not only by ".", "!" or "?", but also by "..." or ":"
    • (\"|\'|)*|}|]) = The sentence can be surrounded by quatation marks or parenthesis
    • (\s|\n|\r|\r\n) = After a sentense have to be a space or end of line
    • g = global
    • m = multiline

    Remarks:

    • If you use (?=[A-Z]), the the RegExp will not work correctly in some languages. E.g. "Ü", "Č" or "Á" will not be recognised.
    0 讨论(0)
  • 2020-11-29 06:42

    @Roger Poon and @Antonín Slejška 's answers work good.

    It'd better if we add trim function and filter empty string:

    const splitBySentence = (str) => {
      return str.replace(/([.?!])(\s)*(?=[A-Z])/g, "$1|")
        .split("|")
        .filter(sentence => !!sentence)
        .map(sentence => sentence.trim());
    }
    

    const splitBySentence = (str) => {
      return str.replace(/([.?!])(\s)*(?=[A-Z])/g, "$1|").split("|").filter(sentence => !!sentence).map(sentence => sentence.trim());
    }
    
    const content = `
    The Times has identified the following reporting anomalies or methodology changes in the data for New York:
    
    May 6: New York State added many deaths from unspecified days after reconciling data from nursing homes and other care facilities.
    
    June 30: New York City released deaths from earlier periods but did not specify when they were from.
    
    Aug. 6: Our database changed to record deaths by New York City residents instead of deaths that took place in New York City.
    
    Aug. 20: New York City removed four previously reported deaths after reviewing records. The state reported four new deaths in other counties.(extracted from NY Times)
    `;
    
    console.log(splitBySentence(content));

    0 讨论(0)
提交回复
热议问题