Python Untokenize a sentence

后端 未结 10 998
名媛妹妹
名媛妹妹 2021-02-01 15:46

There are so many guides on how to tokenize a sentence, but i didn\'t find any on how to do the opposite.

 import nltk
 words = nltk.word_tokenize(\"I\'ve found          


        
10条回答
  •  情深已故
    2021-02-01 16:19

    use token_utils.untokenize from here

    import re
    def untokenize(words):
        """
        Untokenizing a text undoes the tokenizing operation, restoring
        punctuation and spaces to the places that people expect them to be.
        Ideally, `untokenize(tokenize(text))` should be identical to `text`,
        except for line breaks.
        """
        text = ' '.join(words)
        step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
        step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
        step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
        step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
        step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
             "can not", "cannot")
        step6 = step5.replace(" ` ", " '")
        return step6.strip()
    
     tokenized = ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my','disease', '.']
     untokenize(tokenized)
     "I've found a medicine for my disease."
    

提交回复
热议问题