Split string into sentences using regex

前端 未结 6 1931
挽巷
挽巷 2020-11-28 10:54

I have random text stored in $sentences. Using regex, I want to split the text into sentences, see:

function splitSentences($text) {
    $re = \         


        
相关标签:
6条回答
  • 2020-11-28 11:35

    As it should be expected, any sort of natural language processing is not a trivial task. The reason for it is that they are evolutionary systems. There is no single person who sat down and thought about which are good ideas and which - not. Every rule has 20-40% exceptions. With that said the complexity of a single regex that can do your bidding would be off the charts. Still, the following solution relies mainly on regexes.


    • The idea is to gradually go over the text.
    • At any given time, the current chunk of the text will be contained in two different parts. One, which is the candidate for a substring before a sentence boundary and another - after.
    • The first 10 regex pairs detect positions which look like sentence boundaries, but actually aren't. In that case, before and after are advanced without registering a new sentence.
    • If none of these pairs matches, matching will be attempted with the last 3 pairs, possibly detecting a boundary.

    As for where did these regexes come from? - I translated this Ruby library, which is generated based on this paper. If you truly want to understand them, there is no alternative but to read the paper.

    As far as accuracy goes - I encourage you to test it with different texts. After some experimentation, I was very pleasantly surprised.

    In terms of performance - the regexes should be highly performant as all of them have either a \A or \Z anchor, there are almost no repetition quantifiers, and in the places there are - there can't be any backtracking. Still, regexes are regexes. You will have to do some benchmarking if you plan to use this is tight loops on huge chunks of text.


    Mandatory disclaimer: excuse my rusty php skills. The following code might not be the most idiomatic php ever, it should still be clear enough to get the point across.


    function sentence_split($text) {
        $before_regexes = array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
            '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
            '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
            '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
            '/(?:(?:\b[Ee]tc\.\s))\Z/su',
            '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
            '/(?:(?:\b\p{L}\.))\Z/su',
            '/(?:(?:\b\p{L}\.\s))\Z/su',
            '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
            '/(?:(?:[\"”\']\s*))\Z/su',
            '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
            '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
            '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
        $after_regexes = array('/\A(?:)/su',
            '/\A(?:[\p{N}\p{Ll}])/su',
            '/\A(?:[^\p{Lu}])/su',
            '/\A(?:[^\p{Lu}]|I)/su',
            '/\A(?:[^p{Lu}])/su',
            '/\A(?:\p{Ll})/su',
            '/\A(?:\p{L}\.)/su',
            '/\A(?:\p{L}\.\s)/su',
            '/\A(?:\p{N})/su',
            '/\A(?:\s*\p{Ll})/su',
            '/\A(?:)/su',
            '/\A(?:\p{Lu}[^\p{Lu}])/su',
            '/\A(?:\p{Lu}\p{Ll})/su');
        $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
        $count = 13;
    
        $sentences = array();
        $sentence = '';
        $before = '';
        $after = substr($text, 0, 10);
        $text = substr($text, 10);
    
        while($text != '') {
            for($i = 0; $i < $count; $i++) {
                if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                    if($is_sentence_boundary[$i]) {
                        array_push($sentences, $sentence);
                        $sentence = '';
                    }
                    break;
                }
            }
    
            $first_from_text = $text[0];
            $text = substr($text, 1);
            $first_from_after = $after[0];
            $after = substr($after, 1);
            $before .= $first_from_after;
            $sentence .= $first_from_after;
            $after .= $first_from_text;
        }
    
        if($sentence != '' && $after != '') {
            array_push($sentences, $sentence.$after);
        }
    
        return $sentences;
    }
    
    $text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
    print_r(sentence_split($text));
    
    0 讨论(0)
  • 2020-11-28 11:35

    Henrik Petterson Please read it completely because i need to repeat few things which already said above.

    As above many people have mentioned that if you add a \u modifier it will work on Unicode character is TRUE and it is Working Perfectly in the example mentioned below

    http://ideone.com/750lMn

    <?php
    
    
        function splitSentences($text) {
            $re = '/# Split sentences on whitespace between them.
                (?<=                # Begin positive lookbehind.
                  [.!?]             # Either an end of sentence punct,
                | [.!?][\'"]        # or end of sentence punct and quote.
                )                   # End positive lookbehind.
                (?<!                # Begin negative lookbehind.
                  Mr\.              # Skip either "Mr."
                | Mrs\.             # or "Mrs.",
                | Ms\.              # or "Ms.",
                | Jr\.              # or "Jr.",
                | Dr\.              # or "Dr.",
                | Prof\.            # or "Prof.",
                | Vol\.             # or "Vol.",
                | A\.D\.            # or "A.D.",
                | B\.C\.            # or "B.C.",
                | Sr\.              # or "Sr.",
                | T\.V\.A\.         # or "T.V.A.",
                                    # or... (you get the idea).
                )                   # End negative lookbehind.
                \s+                 # Split on whitespace between sentences.
                /uix';
    
            $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
            return $sentences;
        }
    
    $sentences = 'Entertainment media properties. Ã Fairy Tail and Tokyo Ghoul. Entertainment media properties. &Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.';
    
    $sentences = splitSentences($sentences);
    
    print_r($sentences);
    

    Your examples which you have given in comments were not working because they don't have any white space characters between two sentences. And your code specifying it particularly that there must be a white space between sentences.

    \s+                 # Split on whitespace between sentences.
    

    The below example which you have in above comments is not working just because there is no space before Â.

    http://ideone.com/m164fp

    0 讨论(0)
  • 2020-11-28 11:37

    If spaces are unreliable, than you could use match on a . followed by any number of spaces, followed by a capital letter.

    You can match any capital UTF-8 letter using the Unicode character property \p{Lu}.

    You only need to exclude abbreviations which tend to follow own names (person names, company names, etc), since they start with a capital letter.

    function splitSentences($text) {
        $re = '/                # Split sentences ending with a dot
            .+?                 # Match everything before, until we find
            (
              $ |               # the end of the string, or
              \.                # a dot
              (?<!              #  Begin negative lookbehind.
                Mr\.            #   Skip either "Mr."
              | Mrs\.           #   or "Mrs.",
                                #   or... (you get the idea).
              )                 #   End negative lookbehind.
              "?                #   Optionally match a quote
              \s*               #   Any number of whitespaces
              (?=               #  Begin positive lookahead
                \p{Lu} |        #   an upper case letter, or
                "               #   a quote
              )
            )
            /iux';
    
        if (!preg_match_all($re, $text, $matches, PREG_PATTERN_ORDER)) { 
            return [];
        }
    
        $sentences = array_map('trim', $matches[0]);
    
        return $sentences;
    }
    
    $text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
    $sentences = splitSentences($text);
    
    print_r($sentences);
    

    Note: This answer might not be accurate enough for your situation. I'm unable to judge that. It does address the problem as described above and is easily understandable.

    0 讨论(0)
  • 2020-11-28 11:41

      is what it looks like when you print a UTF-8 character U+00A0 Non-Breaking Space to a page/console being interpreted as Latin-1. So I think you have a non-breaking space between the sentences, not a normal space.

    \s can match a non-breaking space too, but you will need to use the /u modifier to tell preg you are sending it a UTF-8-encoded string. Otherwise it, like your print command, will guess Latin-1 and see it as the two characters  .

    0 讨论(0)
  • 2020-11-28 11:53

    There is quite complex Unicode Text Segmentation algorithm that deals with various text boundaries including sentence boundaries.

    http://unicode.org/reports/tr29/

    The best known implementation of this algorithms is by ICU.

    I have found this class: http://php.net/manual/en/class.intlbreakiterator.php however it seems to be in git not in mainstream.

    So if you want to solve this VERY complex problem in best why I'd suggest to:

    • Get this class from somewhere
    • Write a small PHP plugin that wraps ICU functionality you need - it is actually quite simple as long as you build specific functionality.
    0 讨论(0)
  • 2020-11-28 11:54

    I believe that it is impossible to get a bullet-proof sentence splitter considering user-generated content is not always grammatically and syntactically correct. Moreover, reaching 100% correct results is just impossible due to technical imperfection of scraping/content getting tools that may fail to get clean contents that will either contain whitespace or punctuation rubbish. And finally, business is now more biased towards a good-enough strategy, and if you manage to split the text into 95% of times, it is in most cases considered a success.

    Now, any sentence splitting task is an NLP task, and just one, or two, or three regexps are not enough. Rather than think of your own regex chain, I'd advise to use some existing NLP libraries for that.

    1. vanderlee's php-sentence (depends on reasonably gramatically correct punctuation)

    The following is a rough list of the rules used to split sentences.

    • Each linebreak separates sentences.
    • The end of the text indicates the end if a sentence if not otherwise ended through proper punctuation.
    • Sentences must be at least two words long, unless a linebreak or end-of-text.
    • An empty line is not a sentence.
    • Each question- or exclamation mark or combination thereof, is considered the end of a sentence.
    • A single period is considered the end of a sentence, unless...
      • It is preceded by one word, or...
      • It is followed by one word.
    • A sequence of multiple periods is not considered the end of a sentence.

    Usage example:

    <?php
        require_once 'classes/autoloader.php'; // Include the autoloader.
        $text   = "Hello there, Mr. Smith. What're you doing today... Smith,"
                . " my friend?\n\nI hope it's good. This last sentence will"
                . " cost you $2.50! Just kidding :)"; // This is the test text we're going to use
        $Sentence   = new Sentence;   // Create a new instance
        $sentences  = $Sentence->split($text); // Split into array of sentences
        $count      = $Sentence->count($text); // Count the number of sentences
    ?>
    
    1. NlpTools is another library you might utilize for this task. Here is a sample code implementing a naive rule based sentence tokenizer:

    Sample code:

    <?php
    include ('vendor/autoload.php');
     
    use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
    use \NlpTools\Tokenizers\WhitespaceTokenizer;
    use \NlpTools\Classifiers\ClassifierInterface;
    use \NlpTools\Documents\DocumentInterface;
     
    class EndOfSentence implements ClassifierInterface
    {
        public function classify(array $classes, DocumentInterface $d) {
            list($token,$before,$after) = $d->getDocumentData();
     
            $dotcnt = count(explode('.',$token))-1;
            $lastdot = substr($token,-1)=='.';
     
            if (!$lastdot) // assume that all sentences end in full stops
                return 'O';
     
            if ($dotcnt>1) // to catch some naive abbreviations U.S.A.
                return 'O';
     
            return 'EOW';
        }
    }
    $tok = new ClassifierBasedTokenizer(
        new EndOfSentence(),
        new WhitespaceTokenizer()
    );
    $text = "We are what we repeatedly do.
            Excellence, then, is not an act, but a habit.";
     
    print_r($tok->tokenize($text));
     
    // Array
    // (
    //    [0] => We are what we repeatedly do.
    //    [1] => Excellence, then, is not an act, but a habit.
    // )
     
    
    1. You can get a PHP/JAVA bridge for using Java StanfordNLP (here is a Java example of splitting text into sentences).

    IMPORTANT NOTE: Most NLP tokenization models I tested do not handle glued sentences well. However, if you add a space after a punctuation chain, sentence splitting quality raises. Just add this before sending the text to the sentence splitting function:

    $txt = preg_replace('~\p{P}+~', "$0 ", $txt);
    
    0 讨论(0)
提交回复
热议问题