There are so many guides on how to tokenize a sentence, but i didn\'t find any on how to do the opposite.
import nltk
words = nltk.word_tokenize(\"I\'ve found
I propose to keep offsets in tokenization: (token, offset). I think, this information is useful for processing over the original sentence.
import re
from nltk.tokenize import word_tokenize
def offset_tokenize(text):
tail = text
accum = 0
tokens = self.tokenize(text)
info_tokens = []
for tok in tokens:
scaped_tok = re.escape(tok)
m = re.search(scaped_tok, tail)
start, end = m.span()
# global offsets
gs = accum + start
ge = accum + end
accum += end
# keep searching in the rest
tail = tail[end:]
info_tokens.append((tok, (gs, ge)))
return info_token
sent = '''I've found a medicine for my disease.
This is line:3.'''
toks_offsets = offset_tokenize(sent)
for t in toks_offsets:
(tok, offset) = t
print (tok == sent[offset[0]:offset[1]]), tok, sent[offset[0]:offset[1]]
Gives:
True I I
True 've 've
True found found
True a a
True medicine medicine
True for for
True my my
True disease disease
True . .
True This This
True is is
True line:3 line:3
True . .