i currently have a file that contains a list that is looks like
example = [\'Mary had a little lamb\' ,
\'Jack went up the hill\' ,
\'Ji
You can use nltk (as @alvas suggests) and a recursive function which take any object and tokenize each str in:
from nltk.tokenize import word_tokenize
def tokenize(obj):
if obj is None:
return None
elif isinstance(obj, str): # basestring in python 2.7
return word_tokenize(obj)
elif isinstance(obj, list):
return [tokenize(i) for i in obj]
else:
return obj # Or throw an exception, or parse a dict...
Usage:
data = [["Lorem ipsum dolor. Sit amet?", "Hello World!", None], ["a"], "Hi!", None, ""]
print(tokenize(data))
Output:
[[['Lorem', 'ipsum', 'dolor', '.', 'Sit', 'amet', '?'], ['Hello', 'World', '!'], None], [['a']], ['Hi', '!'], None, []]