Python Untokenize a sentence

后端 未结 10 993
名媛妹妹
名媛妹妹 2021-02-01 15:46

There are so many guides on how to tokenize a sentence, but i didn\'t find any on how to do the opposite.

 import nltk
 words = nltk.word_tokenize(\"I\'ve found          


        
10条回答
  •  执念已碎
    2021-02-01 16:13

    I propose to keep offsets in tokenization: (token, offset). I think, this information is useful for processing over the original sentence.

    import re
    from nltk.tokenize import word_tokenize
    
    def offset_tokenize(text):
        tail = text
        accum = 0
        tokens = self.tokenize(text)
        info_tokens = []
        for tok in tokens:
            scaped_tok = re.escape(tok)
            m = re.search(scaped_tok, tail)
            start, end = m.span()
            # global offsets
            gs = accum + start
            ge = accum + end
            accum += end
            # keep searching in the rest
            tail = tail[end:]
            info_tokens.append((tok, (gs, ge)))
        return info_token
    
    sent = '''I've found a medicine for my disease.
    
    This is line:3.'''
    
    toks_offsets = offset_tokenize(sent)
    
    for t in toks_offsets:
    (tok, offset) = t
    print (tok == sent[offset[0]:offset[1]]), tok, sent[offset[0]:offset[1]]
    

    Gives:

    True I I
    True 've 've
    True found found
    True a a
    True medicine medicine
    True for for
    True my my
    True disease disease
    True . .
    True This This
    True is is
    True line:3 line:3
    True . .
    

提交回复
热议问题