There are so many guides on how to tokenize a sentence, but i didn\'t find any on how to do the opposite.
import nltk
words = nltk.word_tokenize(\"I\'ve found
For me, it worked when I installed python nltk 3.2.5,
pip install -U nltk
then,
import nltk
nltk.download('perluniprops')
from nltk.tokenize.moses import MosesDetokenizer
If you are using insides pandas dataframe, then
df['detoken']=df['token_column'].apply(lambda x: detokenizer.detokenize(x, return_str=True))
I am using following code without any major library function for detokeization purpose. I am using detokenization for some specific tokens
_SPLITTER_ = r"([-.,/:!?\";)(])"
def basic_detokenizer(sentence):
""" This is the basic detokenizer helps us to resolves the issues we created by our tokenizer"""
detokenize_sentence =[]
words = sentence.split(' ')
pos = 0
while( pos < len(words)):
if words[pos] in '-/.' and pos > 0 and pos < len(words) - 1:
left = detokenize_sentence.pop()
detokenize_sentence.append(left +''.join(words[pos:pos + 2]))
pos +=1
elif words[pos] in '[(' and pos < len(words) - 1:
detokenize_sentence.append(''.join(words[pos:pos + 2]))
pos +=1
elif words[pos] in ']).,:!?;' and pos > 0:
left = detokenize_sentence.pop()
detokenize_sentence.append(left + ''.join(words[pos:pos + 1]))
else:
detokenize_sentence.append(words[pos])
pos +=1
return ' '.join(detokenize_sentence)
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'
Use the join function:
You could just do a ' '.join(words)
to get back the original string.
To reverse word_tokenize
from nltk
, i suggest looking in http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize and do some reverse engineering.
Short of doing crazy hacks on nltk, you can try this:
>>> import nltk
>>> import string
>>> nltk.word_tokenize("I've found a medicine for my disease.")
['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
>>> tokens = nltk.word_tokenize("I've found a medicine for my disease.")
>>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
"I've found a medicine for my disease."
The reason there is no simple answer is you actually need the span locations of the original tokens in the string. If you don't have that, and you aren't reverse engineering your original tokenization, your reassembled string is based on guesses about the tokenization rules that were used. If your tokenizer didn't give you spans, you can still do this if you have three things:
1) The original string
2) The original tokens
3) The modified tokens (I'm assuming you have changed the tokens in some way, because that is the only application for this I can think of if you already have #1)
Use the original token set to identify spans (wouldn't it be nice if the tokenizer did that?) and modify the string from back to front so the spans don't change as you go.
Here I'm using TweetTokenizer but it shouldn't matter as long as the tokenizer you use doesn't change the values of your tokens so that they aren't actually in the original string.
tokenizer=nltk.tokenize.casual.TweetTokenizer()
string="One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin."
tokens=tokenizer.tokenize(string)
replacement_tokens=list(tokens)
replacement_tokens[-3]="cute"
def detokenize(string,tokens,replacement_tokens):
spans=[]
cursor=0
for token in tokens:
while not string[cursor:cursor+len(token)]==token and cursor<len(string):
cursor+=1
if cursor==len(string):break
newcursor=cursor+len(token)
spans.append((cursor,newcursor))
cursor=newcursor
i=len(tokens)-1
for start,end in spans[::-1]:
string=string[:start]+replacement_tokens[i]+string[end:]
i-=1
return string
>>> detokenize(string,tokens,replacement_tokens)
'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a cute vermin.'