There are so many guides on how to tokenize a sentence, but i didn\'t find any on how to do the opposite.
import nltk
words = nltk.word_tokenize(\"I\'ve found
I propose to keep offsets in tokenization: (token, offset). I think, this information is useful for processing over the original sentence.
import re
from nltk.tokenize import word_tokenize
def offset_tokenize(text):
tail = text
accum = 0
tokens = self.tokenize(text)
info_tokens = []
for tok in tokens:
scaped_tok = re.escape(tok)
m = re.search(scaped_tok, tail)
start, end = m.span()
# global offsets
gs = accum + start
ge = accum + end
accum += end
# keep searching in the rest
tail = tail[end:]
info_tokens.append((tok, (gs, ge)))
return info_token
sent = '''I've found a medicine for my disease.
This is line:3.'''
toks_offsets = offset_tokenize(sent)
for t in toks_offsets:
(tok, offset) = t
print (tok == sent[offset[0]:offset[1]]), tok, sent[offset[0]:offset[1]]
Gives:
True I I
True 've 've
True found found
True a a
True medicine medicine
True for for
True my my
True disease disease
True . .
True This This
True is is
True line:3 line:3
True . .
You can use "treebank detokenizer" - TreebankWordDetokenizer
:
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'
There is also MosesDetokenizer
which was in nltk
but got removed because of the licensing issues, but it is available as a Sacremoses standalone package.
The reason tokenize.untokenize
does not work is because it needs more information than just the words. Here is an example program using tokenize.untokenize
:
from StringIO import StringIO
import tokenize
sentence = "I've found a medicine for my disease.\n"
tokens = tokenize.generate_tokens(StringIO(sentence).readline)
print tokenize.untokenize(tokens)
Additional Help:
Tokenize - Python Docs |
Potential Problem
use token_utils.untokenize
from here
import re
def untokenize(words):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, `untokenize(tokenize(text))` should be identical to `text`,
except for line breaks.
"""
text = ' '.join(words)
step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...')
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
"can not", "cannot")
step6 = step5.replace(" ` ", " '")
return step6.strip()
tokenized = ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my','disease', '.']
untokenize(tokenized)
"I've found a medicine for my disease."