Python Untokenize a sentence

后端 未结 10 997
名媛妹妹
名媛妹妹 2021-02-01 15:46

There are so many guides on how to tokenize a sentence, but i didn\'t find any on how to do the opposite.

 import nltk
 words = nltk.word_tokenize(\"I\'ve found          


        
10条回答
  •  梦如初夏
    2021-02-01 16:12

    The reason there is no simple answer is you actually need the span locations of the original tokens in the string. If you don't have that, and you aren't reverse engineering your original tokenization, your reassembled string is based on guesses about the tokenization rules that were used. If your tokenizer didn't give you spans, you can still do this if you have three things:

    1) The original string

    2) The original tokens

    3) The modified tokens (I'm assuming you have changed the tokens in some way, because that is the only application for this I can think of if you already have #1)

    Use the original token set to identify spans (wouldn't it be nice if the tokenizer did that?) and modify the string from back to front so the spans don't change as you go.

    Here I'm using TweetTokenizer but it shouldn't matter as long as the tokenizer you use doesn't change the values of your tokens so that they aren't actually in the original string.

    tokenizer=nltk.tokenize.casual.TweetTokenizer()
    string="One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin."
    tokens=tokenizer.tokenize(string)
    replacement_tokens=list(tokens)
    replacement_tokens[-3]="cute"
    
    def detokenize(string,tokens,replacement_tokens):
        spans=[]
        cursor=0
        for token in tokens:
            while not string[cursor:cursor+len(token)]==token and cursor>> detokenize(string,tokens,replacement_tokens)
    'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a cute vermin.'
    

提交回复
热议问题