Tokenize words in a list of sentences Python

后端 未结 7 1460
广开言路
广开言路 2021-02-04 06:41

i currently have a file that contains a list that is looks like

example = [\'Mary had a little lamb\' , 
           \'Jack went up the hill\' , 
           \'Ji         


        
7条回答
  •  感情败类
    2021-02-04 07:11

    You can use nltk (as @alvas suggests) and a recursive function which take any object and tokenize each str in:

    from nltk.tokenize import word_tokenize
    def tokenize(obj):
        if obj is None:
            return None
        elif isinstance(obj, str): # basestring in python 2.7
            return word_tokenize(obj)
        elif isinstance(obj, list):
            return [tokenize(i) for i in obj]
        else:
            return obj # Or throw an exception, or parse a dict...
    

    Usage:

    data = [["Lorem ipsum dolor. Sit amet?", "Hello World!", None], ["a"], "Hi!", None, ""]
    print(tokenize(data))
    

    Output:

    [[['Lorem', 'ipsum', 'dolor', '.', 'Sit', 'amet', '?'], ['Hello', 'World', '!'], None], [['a']], ['Hi', '!'], None, []]
    

提交回复
热议问题