splitting a list of sentences into separate words in a list

后端 未结 4 392
失恋的感觉
失恋的感觉 2021-01-16 18:33

I have a list which consists of lines as

lines =  [\'The query complexity of estimating weighted averages.\',
     \'New bounds for the query complexity of a         


        
相关标签:
4条回答
  • 2021-01-16 19:05

    You can do it by:

    import nltk
    nltk.download('punkt')
    from nltk.tokenize import word_tokenize
    
    lines =  ['The query complexity of estimating weighted averages.',
     'New bounds for the query complexity of an algorithm that learns',
     'DFAs with correction equivalence queries.',
     'general procedure to check conjunctive query containment.']
    
    joint_words = ' '.join(lines)
    
    separated_words = word_tokenize(joint_words)
    
    print(separated_words)
    

    Output will be :

    ['The', 'query', 'complexity', 'of', 'estimating', 'weighted', 'averages', '.', 'New', 'bounds', 'for', 'the', 'query', 'complexity', 'of', 'an', 'algorithm', 'that', 'learns', 'DFAs', 'with', 'correction', 'equivalence', 'queries', '.', 'general', 'procedure', 'to', 'check', 'conjunctive', 'query', 'containment', '.']
    

    In addition, if you want to merge the dots with previous string (which appear as independent strings in the list), run the following code:

    for i, j in enumerate(separated_words):
        if '.' in j:
            separated_words[i-1] = separated_words[i-1] + separated_words[i]
            del separated_words[i]    # For deleting duplicate entry
    
    print(separated_words)
    

    Output will be:

    ['The', 'query', 'complexity', 'of', 'estimating', 'weighted', 'averages.', 'New', 'bounds', 'for', 'the', 'query', 'complexity', 'of', 'an', 'algorithm', 'that', 'learns', 'DFAs', 'with', 'correction', 'equivalence', 'queries.', 'general', 'procedure', 'to', 'check', 'conjunctive', 'query', 'containment.']
    
    0 讨论(0)
  • 2021-01-16 19:06

    It sounds like you want something similar to this where a string is split based on whitespace:

    lines[0].split()
    

    The above would split your lines list (which seems to contain 1 item) using the whitespace in that string.

    0 讨论(0)
  • 2021-01-16 19:07

    You can join all lines and then use split():

    " ".join(lines).split()
    

    or you can split each line and chain:

    from itertools import chain
    list(chain(*map(str.split, lines)))
    
    0 讨论(0)
  • 2021-01-16 19:11

    You can use a list comprehension:

    >>> lines =  [
    ...     'The query complexity of estimating weighted averages.',
    ...     'New bounds for the query complexity of an algorithm that learns',
    ... ]
    >>> [word for line in lines for word in line.split()]
    ['The', 'query', 'complexity', 'of', 'estimating', 'weighted','averages.', 'New', 'bounds', 'for', 'the', 'query', 'complexity', 'of', 'an', 'algorithm', 'that', 'learns']
    
    0 讨论(0)
提交回复
热议问题