Python Untokenize a sentence

后端 未结 10 994
名媛妹妹
名媛妹妹 2021-02-01 15:46

There are so many guides on how to tokenize a sentence, but i didn\'t find any on how to do the opposite.

 import nltk
 words = nltk.word_tokenize(\"I\'ve found          


        
相关标签:
10条回答
  • 2021-02-01 16:00

    For me, it worked when I installed python nltk 3.2.5,

    pip install -U nltk
    

    then,

    import nltk
    nltk.download('perluniprops')
    
    from nltk.tokenize.moses import MosesDetokenizer
    

    If you are using insides pandas dataframe, then

    df['detoken']=df['token_column'].apply(lambda x: detokenizer.detokenize(x, return_str=True))
    
    0 讨论(0)
  • 2021-02-01 16:02

    I am using following code without any major library function for detokeization purpose. I am using detokenization for some specific tokens

    _SPLITTER_ = r"([-.,/:!?\";)(])"
    
    def basic_detokenizer(sentence):
    """ This is the basic detokenizer helps us to resolves the issues we created by  our tokenizer"""
    detokenize_sentence =[]
    words = sentence.split(' ')
    pos = 0
    while( pos < len(words)):
        if words[pos] in '-/.' and pos > 0 and pos < len(words) - 1:
            left = detokenize_sentence.pop()
            detokenize_sentence.append(left +''.join(words[pos:pos + 2]))
            pos +=1
        elif  words[pos] in '[(' and pos < len(words) - 1:
            detokenize_sentence.append(''.join(words[pos:pos + 2]))   
            pos +=1        
        elif  words[pos] in ']).,:!?;' and pos > 0:
            left  = detokenize_sentence.pop()
            detokenize_sentence.append(left + ''.join(words[pos:pos + 1]))            
        else:
            detokenize_sentence.append(words[pos])
        pos +=1
    return ' '.join(detokenize_sentence)
    
    0 讨论(0)
  • 2021-02-01 16:09
    from nltk.tokenize.treebank import TreebankWordDetokenizer
    TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
    # 'The quick brown'
    
    0 讨论(0)
  • 2021-02-01 16:09

    Use the join function:

    You could just do a ' '.join(words) to get back the original string.

    0 讨论(0)
  • 2021-02-01 16:12

    To reverse word_tokenize from nltk, i suggest looking in http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize and do some reverse engineering.

    Short of doing crazy hacks on nltk, you can try this:

    >>> import nltk
    >>> import string
    >>> nltk.word_tokenize("I've found a medicine for my disease.")
    ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
    >>> tokens = nltk.word_tokenize("I've found a medicine for my disease.")
    >>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
    "I've found a medicine for my disease."
    
    0 讨论(0)
  • 2021-02-01 16:12

    The reason there is no simple answer is you actually need the span locations of the original tokens in the string. If you don't have that, and you aren't reverse engineering your original tokenization, your reassembled string is based on guesses about the tokenization rules that were used. If your tokenizer didn't give you spans, you can still do this if you have three things:

    1) The original string

    2) The original tokens

    3) The modified tokens (I'm assuming you have changed the tokens in some way, because that is the only application for this I can think of if you already have #1)

    Use the original token set to identify spans (wouldn't it be nice if the tokenizer did that?) and modify the string from back to front so the spans don't change as you go.

    Here I'm using TweetTokenizer but it shouldn't matter as long as the tokenizer you use doesn't change the values of your tokens so that they aren't actually in the original string.

    tokenizer=nltk.tokenize.casual.TweetTokenizer()
    string="One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin."
    tokens=tokenizer.tokenize(string)
    replacement_tokens=list(tokens)
    replacement_tokens[-3]="cute"
    
    def detokenize(string,tokens,replacement_tokens):
        spans=[]
        cursor=0
        for token in tokens:
            while not string[cursor:cursor+len(token)]==token and cursor<len(string):
                cursor+=1        
            if cursor==len(string):break
            newcursor=cursor+len(token)
            spans.append((cursor,newcursor))
            cursor=newcursor
        i=len(tokens)-1
        for start,end in spans[::-1]:
            string=string[:start]+replacement_tokens[i]+string[end:]
            i-=1
        return string
    
    >>> detokenize(string,tokens,replacement_tokens)
    'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a cute vermin.'
    
    0 讨论(0)
提交回复
热议问题