Python re.split() vs nltk word_tokenize and sent_tokenize

后端 未结 1 1880
不知归路
不知归路 2020-12-29 08:54

I was going through this question.

Am just wondering whether NLTK would be faster than regex in word/sentence tokenization.

相关标签:
1条回答
  • 2020-12-29 09:12

    The default nltk.word_tokenize() is using the Treebank tokenizer that emulates the tokenizer from the Penn Treebank tokenizer.

    Do note that str.split() doesn't achieve tokens in the linguistics sense, e.g.:

    >>> sent = "This is a foo, bar sentence."
    >>> sent.split()
    ['This', 'is', 'a', 'foo,', 'bar', 'sentence.']
    >>> from nltk import word_tokenize
    >>> word_tokenize(sent)
    ['This', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.']
    

    It is usually used to separate strings with specified delimiter, e.g. in a tab-separated file, you can use str.split('\t') or when you are trying to split a string by the newline \n when your textfile has one sentence per line.

    And let's do some benchmarking in python3:

    import time
    from nltk import word_tokenize
    
    import urllib.request
    url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
    response = urllib.request.urlopen(url)
    data = response.read().decode('utf8')
    
    for _ in range(10):
        start = time.time()
        for line in data.split('\n'):
            line.split()
        print ('str.split():\t', time.time() - start)
    
    for _ in range(10):
        start = time.time()
        for line in data.split('\n'):
            word_tokenize(line)
        print ('word_tokenize():\t', time.time() - start)
    

    [out]:

    str.split():     0.05451083183288574
    str.split():     0.054320573806762695
    str.split():     0.05368804931640625
    str.split():     0.05416440963745117
    str.split():     0.05299568176269531
    str.split():     0.05304527282714844
    str.split():     0.05356955528259277
    str.split():     0.05473494529724121
    str.split():     0.053118228912353516
    str.split():     0.05236077308654785
    word_tokenize():     4.056122779846191
    word_tokenize():     4.052812337875366
    word_tokenize():     4.042144775390625
    word_tokenize():     4.101543664932251
    word_tokenize():     4.213029146194458
    word_tokenize():     4.411528587341309
    word_tokenize():     4.162556886672974
    word_tokenize():     4.225975036621094
    word_tokenize():     4.22914719581604
    word_tokenize():     4.203172445297241
    

    If we try a another tokenizers in bleeding edge NLTK from https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl:

    import time
    from nltk.tokenize import ToktokTokenizer
    
    import urllib.request
    url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
    response = urllib.request.urlopen(url)
    data = response.read().decode('utf8')
    
    toktok = ToktokTokenizer().tokenize
    
    for _ in range(10):
        start = time.time()
        for line in data.split('\n'):
            toktok(line)
        print ('toktok:\t', time.time() - start)
    

    [out]:

    toktok:  1.5902607440948486
    toktok:  1.5347232818603516
    toktok:  1.4993178844451904
    toktok:  1.5635688304901123
    toktok:  1.5779635906219482
    toktok:  1.8177132606506348
    toktok:  1.4538452625274658
    toktok:  1.5094449520111084
    toktok:  1.4871931076049805
    toktok:  1.4584410190582275
    

    (Note: the source of the text file is from https://github.com/Simdiva/DSL-Task)


    If we look at the native perl implementation, the python vs perl time for the ToktokTokenizer is comparable. But do that in the python implementation the regexes are pre-compiled while in perl, it isn't but then the proof is still in the pudding:

    alvas@ubi:~$ wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
    --2016-02-11 20:36:36--  https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
    Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
    Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 2690 (2.6K) [text/plain]
    Saving to: ‘tok-tok.pl’
    
    100%[===============================================================================================================================>] 2,690       --.-K/s   in 0s      
    
    2016-02-11 20:36:36 (259 MB/s) - ‘tok-tok.pl’ saved [2690/2690]
    
    alvas@ubi:~$ wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
    --2016-02-11 20:36:38--  https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
    Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
    Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 3483550 (3.3M) [text/plain]
    Saving to: ‘test.txt’
    
    100%[===============================================================================================================================>] 3,483,550    363KB/s   in 7.4s   
    
    2016-02-11 20:36:46 (459 KB/s) - ‘test.txt’ saved [3483550/3483550]
    
    alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
    
    real    0m1.703s
    user    0m1.693s
    sys 0m0.008s
    alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
    
    real    0m1.715s
    user    0m1.704s
    sys 0m0.008s
    alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
    
    real    0m1.700s
    user    0m1.686s
    sys 0m0.012s
    alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
    
    real    0m1.727s
    user    0m1.700s
    sys 0m0.024s
    alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
    
    real    0m1.734s
    user    0m1.724s
    sys 0m0.008s
    

    (Note: When timing the tok-tok.pl, we had to pipe the output into a file, so the timing here includes the time the machine takes to output to file, whereas in the nltk.tokenize.ToktokTokenizer timing, it's doesn't include time to output into a file)


    With regards to sent_tokenize(), it's a little different and comparing speed benchmark without considering accuracy is a little quirky.

    Consider this:

    • If a regex splits a textfile/paragraph up in 1 sentence, then the speed is almost instantaneous, i.e. 0 work done. But that would be a horrible sentence tokenizer...

    • If sentences in a file is already separated by \n, then that is simply a case of comparing how str.split('\n') vs re.split('\n') and nltk would have nothing to do with the sentence tokenization ;P

    For information on how sent_tokenize() works in NLTK, see:

    • training data format for nltk punkt
    • Use of PunktSentenceTokenizer in NLTK

    So to effectively compare sent_tokenize() vs other regex based methods (not str.split('\n')), one would have to evaluate also the accuracy and have a dataset with humanly evaluated sentence in a tokenized format.

    Consider this task: https://www.hackerrank.com/challenges/from-paragraphs-to-sentences

    Given the text:

    In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.

    We want to get this:

    In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
    Such were Willarski and even the Grand Master of the principal lodge.
    Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
    These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
    Pierre began to feel dissatisfied with what he was doing.
    Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
    He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
    And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
    What is to be done in these circumstances?
    To favor revolutions, overthrow everything, repel force by force?
    No!
    We are very far from that.
    Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
    "But what is there in running across it like that?" said Ilagin's groom.
    "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.
    

    So simply doing str.split('\n') will give you nothing. Even without considering the order of the sentences, you will yield 0 positive result:

    >>> text = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. """
    >>> answer = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
    ... Such were Willarski and even the Grand Master of the principal lodge.
    ... Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
    ... These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
    ... Pierre began to feel dissatisfied with what he was doing.
    ... Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
    ... He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
    ... And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
    ... What is to be done in these circumstances?
    ... To favor revolutions, overthrow everything, repel force by force?
    ... No!
    ... We are very far from that.
    ... Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
    ... "But what is there in running across it like that?" said Ilagin's groom.
    ... "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement."""
    >>> 
    >>> output = text.split('\n')
    >>> sum(1 for sent in text.split('\n') if sent in answer)
    0
    
    0 讨论(0)
提交回复
热议问题