Python Untokenize a sentence

后端 未结 10 995
名媛妹妹
名媛妹妹 2021-02-01 15:46

There are so many guides on how to tokenize a sentence, but i didn\'t find any on how to do the opposite.

 import nltk
 words = nltk.word_tokenize(\"I\'ve found          


        
相关标签:
10条回答
  • 2021-02-01 16:13

    I propose to keep offsets in tokenization: (token, offset). I think, this information is useful for processing over the original sentence.

    import re
    from nltk.tokenize import word_tokenize
    
    def offset_tokenize(text):
        tail = text
        accum = 0
        tokens = self.tokenize(text)
        info_tokens = []
        for tok in tokens:
            scaped_tok = re.escape(tok)
            m = re.search(scaped_tok, tail)
            start, end = m.span()
            # global offsets
            gs = accum + start
            ge = accum + end
            accum += end
            # keep searching in the rest
            tail = tail[end:]
            info_tokens.append((tok, (gs, ge)))
        return info_token
    
    sent = '''I've found a medicine for my disease.
    
    This is line:3.'''
    
    toks_offsets = offset_tokenize(sent)
    
    for t in toks_offsets:
    (tok, offset) = t
    print (tok == sent[offset[0]:offset[1]]), tok, sent[offset[0]:offset[1]]
    

    Gives:

    True I I
    True 've 've
    True found found
    True a a
    True medicine medicine
    True for for
    True my my
    True disease disease
    True . .
    True This This
    True is is
    True line:3 line:3
    True . .
    
    0 讨论(0)
  • 2021-02-01 16:14

    You can use "treebank detokenizer" - TreebankWordDetokenizer:

    from nltk.tokenize.treebank import TreebankWordDetokenizer
    TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
    # 'The quick brown'
    

    There is also MosesDetokenizer which was in nltk but got removed because of the licensing issues, but it is available as a Sacremoses standalone package.

    0 讨论(0)
  • 2021-02-01 16:18

    The reason tokenize.untokenize does not work is because it needs more information than just the words. Here is an example program using tokenize.untokenize:

    from StringIO import StringIO
    import tokenize
    
    sentence = "I've found a medicine for my disease.\n"
    tokens = tokenize.generate_tokens(StringIO(sentence).readline)
    print tokenize.untokenize(tokens)
    


    Additional Help: Tokenize - Python Docs | Potential Problem

    0 讨论(0)
  • 2021-02-01 16:19

    use token_utils.untokenize from here

    import re
    def untokenize(words):
        """
        Untokenizing a text undoes the tokenizing operation, restoring
        punctuation and spaces to the places that people expect them to be.
        Ideally, `untokenize(tokenize(text))` should be identical to `text`,
        except for line breaks.
        """
        text = ' '.join(words)
        step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
        step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
        step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
        step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
        step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
             "can not", "cannot")
        step6 = step5.replace(" ` ", " '")
        return step6.strip()
    
     tokenized = ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my','disease', '.']
     untokenize(tokenized)
     "I've found a medicine for my disease."
    
    0 讨论(0)
提交回复
热议问题