I do match of words in a text to retrieve the word offset begin and end. This normally works for both ascii and unicode texts when using an appropriate unicode-aware regex l