Python Untokenize a sentence

后端 未结 10 985
名媛妹妹
名媛妹妹 2021-02-01 15:46

There are so many guides on how to tokenize a sentence, but i didn\'t find any on how to do the opposite.

 import nltk
 words = nltk.word_tokenize(\"I\'ve found          


        
10条回答
  •  孤独总比滥情好
    2021-02-01 16:02

    I am using following code without any major library function for detokeization purpose. I am using detokenization for some specific tokens

    _SPLITTER_ = r"([-.,/:!?\";)(])"
    
    def basic_detokenizer(sentence):
    """ This is the basic detokenizer helps us to resolves the issues we created by  our tokenizer"""
    detokenize_sentence =[]
    words = sentence.split(' ')
    pos = 0
    while( pos < len(words)):
        if words[pos] in '-/.' and pos > 0 and pos < len(words) - 1:
            left = detokenize_sentence.pop()
            detokenize_sentence.append(left +''.join(words[pos:pos + 2]))
            pos +=1
        elif  words[pos] in '[(' and pos < len(words) - 1:
            detokenize_sentence.append(''.join(words[pos:pos + 2]))   
            pos +=1        
        elif  words[pos] in ']).,:!?;' and pos > 0:
            left  = detokenize_sentence.pop()
            detokenize_sentence.append(left + ''.join(words[pos:pos + 1]))            
        else:
            detokenize_sentence.append(words[pos])
        pos +=1
    return ' '.join(detokenize_sentence)
    

提交回复
热议问题