Python Untokenize a sentence

后端未结

关注

 10  977

There are so many guides on how to tokenize a sentence, but i didn\'t find any on how to do the opposite.

 import nltk
 words = nltk.word_tokenize(\"I\'ve found


                      
              相关标签:


      
      
        
          10条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  执念已碎        
                
              
                            
                2021-02-01 16:13
              
            
            
                                                                       
I propose to keep offsets in tokenization: (token, offset).
I think, this information is useful for processing over the original sentence.

import re
from nltk.tokenize import word_tokenize

def offset_tokenize(text):
    tail = text
    accum = 0
    tokens = self.tokenize(text)
    info_tokens = []
    for tok in tokens:
        scaped_tok = re.escape(tok)
        m = re.search(scaped_tok, tail)
        start, end = m.span()
        # global offsets
        gs = accum + start
        ge = accum + end
        accum += end
        # keep searching in the rest
        tail = tail[end:]
        info_tokens.append((tok, (gs, ge)))
    return info_token

sent = '''I've found a medicine for my disease.

This is line:3.'''

toks_offsets = offset_tokenize(sent)

for t in toks_offsets:
(tok, offset) = t
print (tok == sent[offset[0]:offset[1]]), tok, sent[offset[0]:offset[1]]


Gives:

True I I
True 've 've
True found found
True a a
True medicine medicine
True for for
True my my
True disease disease
True . .
True This This
True is is
True line:3 line:3
True . .

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  攒了一身酷        
                
              
                            
                2021-02-01 16:14
              
            
            
                                                                       
You can use "treebank detokenizer" - TreebankWordDetokenizer:

from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'




There is also MosesDetokenizer which was in nltk but got removed because of the licensing issues, but it is available as a Sacremoses standalone package.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北海茫月        
                
              
                            
                2021-02-01 16:18
              
            
            
                                                                       
The reason tokenize.untokenize does not work is because it needs more information than just the words. Here is an example program using tokenize.untokenize:

from StringIO import StringIO
import tokenize

sentence = "I've found a medicine for my disease.\n"
tokens = tokenize.generate_tokens(StringIO(sentence).readline)
print tokenize.untokenize(tokens)




Additional Help:
Tokenize - Python Docs |
Potential Problem
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情深已故        
                
              
                            
                2021-02-01 16:19
              
            
            
                                                                       
use token_utils.untokenize from here 

import re
def untokenize(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

 tokenized = ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my','disease', '.']
 untokenize(tokenized)
 "I've found a medicine for my disease."

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复