Java NLP: Extracting Indicies When Tokenizing Text
问题 When tokenizing a string of text, I need to extract the indexes of the tokenized words. For example, given: "Mary didn't kiss John" I would need something like: [(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)] Where 0, 5, 8, 12 and 17 correspond to the index (in the original string) where the token began. I cannot rely on just whitespace, since some words become 2 tokens. Further, I cannot just search for the token in the string, since the word likely will appear multiple times. One