Python Sliding Window on sentence string

后端 未结 3 770
渐次进展
渐次进展 2021-01-15 10:53

I\'m looking for a sliding window splitter of string composed with words with window size N.

Input: \"I love food and I like drink\" , window size 3

相关标签:
3条回答
  • 2021-01-15 11:34
    def token_sliding_window(str, size):
        tokens = str.split(' ')
        for i in range(len(tokens )- size + 1):
            yield tokens[i: i+size]
    
    0 讨论(0)
  • 2021-01-15 11:37

    You can use iterator with different offsets and zip all of them.

    >>> arr = "I love food. blah blah".split()
    >>> its = [iter(arr), iter(arr[1:]), iter(arr[2:])] #Construct the pattern for longer windowss
    >>> zip(*its)
    [('I', 'love', 'food.'), ('love', 'food.', 'blah'), ('food.', 'blah', 'blah')]
    

    You might want to use izip if you have long sentences, or may be plain old loops (like in the other answer).

    0 讨论(0)
  • 2021-01-15 11:46

    An approach based on subscripting the string sequence:

    def split_on_window(sequence="I love food and I like drink", limit=4):
        results = []
        split_sequence = sequence.split()
        iteration_length = len(split_sequence) - (limit - 1)
        max_window_indicies = range(iteration_length)
        for index in max_window_indicies:
            results.append(split_sequence[index:index + limit])
        return results
    

    Sample Output:

    >>> split_on_window("I love food and I like drink", 3)
    ['I', 'love', 'food']
    ['love', 'food', 'and']
    ['food', 'and', 'I']
    ['and', 'I', 'like']
    ['I', 'like', 'drink']
    

    Here's an alternative answer inspired by @SuperSaiyan:

    from itertools import izip
    
    def split_on_window(sequence, limit):
        split_sequence = sequence.split()
        iterators = [iter(split_sequence[index:]) for index in range(limit)]
        return izip(*iterators)
    

    Sample Output:

    >>> list(split_on_window(s, 4))
    [('I', 'love', 'food', 'and'), ('love', 'food', 'and', 'I'), 
    ('food', 'and', 'I', 'like'), ('and', 'I', 'like', 'drink')]
    

    Benchmarks:

    Sequence = I love food and I like drink, limit = 3
    Repetitions = 1000000
    Using subscripting -> 3.8326420784
    Using izip -> 5.41380286217 # Modified to return a list for the benchmark.
    
    0 讨论(0)
提交回复
热议问题