Python: find all possible word combinations with a sequence of characters (word segmentation)

后端 未结 4 2042
被撕碎了的回忆
被撕碎了的回忆 2021-01-14 05:39

I\'m doing some word segmentation experiments like the followings.

lst is a sequence of characters, and output is all the possible words.

相关标签:
4条回答
  • 2021-01-14 06:16
    #!/usr/bin/env python
    from itertools import combinations
    a = ['a', 'b', 'c', 'd']
    a = "".join(a)
    cuts = []
    for i in range(0,len(a)):
        cuts.extend(combinations(range(1,len(a)),i))
    for i in cuts:
        last = 0
        output = []
        for j in i:
            output.append(a[last:j])
            last = j
        output.append(a[last:])
        print(output)
    

    output:

    zsh 2419 % ./words.py  
    ['abcd']
    ['a', 'bcd']
    ['ab', 'cd']
    ['abc', 'd']
    ['a', 'b', 'cd']
    ['a', 'bc', 'd']
    ['ab', 'c', 'd']
    ['a', 'b', 'c', 'd']
    
    0 讨论(0)
  • 2021-01-14 06:25

    itertools.product should indeed be able to help you.

    The idea is this:- Consider A1, A2, ..., AN separated by slabs. There will be N-1 slabs. If there is a slab there is a segmentation. If there is no slab, there is a join. Thus, for a given sequence of length N, you should have 2^(N-1) such combinations.

    Just like the below

    import itertools
    lst = ['a', 'b', 'c', 'd']
    combinatorics = itertools.product([True, False], repeat=len(lst) - 1)
    
    solution = []
    for combination in combinatorics:
        i = 0
        one_such_combination = [lst[i]]
        for slab in combination:
            i += 1
            if not slab: # there is a join
                one_such_combination[-1] += lst[i]
            else:
                one_such_combination += [lst[i]]
        solution.append(one_such_combination)
    
    print solution
    
    0 讨论(0)
  • 2021-01-14 06:36

    There are 8 options, each mirroring the binary numbers 0 through 7:

    000
    001
    010
    011
    100
    101
    110
    111
    

    Each 0 and 1 represents whether or not the 2 letters at that index are "glued" together. 0 for no, 1 for yes.

    >>> lst = ['a', 'b', 'c', 'd']
    ... output = []
    ... formatstr = "{{:0{}.0f}}".format(len(lst)-1)
    ... for i in range(2**(len(lst)-1)):
    ...     output.append([])
    ...     s = "{:b}".format(i)
    ...     s = str(formatstr.format(float(s)))
    ...     lstcopy = lst[:]
    ...     for j, c in enumerate(s):
    ...         if c == "1":
    ...             lstcopy[j+1] = lstcopy[j] + lstcopy[j+1]
    ...         else:
    ...             output[-1].append(lstcopy[j])
    ...     output[-1].append(lstcopy[-1])
    ... output
    [['a', 'b', 'c', 'd'],
     ['a', 'b', 'cd'],
     ['a', 'bc', 'd'],
     ['a', 'bcd'],
     ['ab', 'c', 'd'],
     ['ab', 'cd'],
     ['abc', 'd'],
     ['abcd']]
    >>> 
    
    0 讨论(0)
  • 2021-01-14 06:37

    You can use a recursive generator:

    def split_combinations(L):
        for split in range(1, len(L)):
            for combination in split_combinations(L[split:]):
                yield [L[:split]] + combination
        yield [L]
    
    print (list(split_combinations('abcd')))
    

    Edit. I'm not sure how well this would scale up for long strings and at what point it hits Python's recursion limit. Similarly to some of the other answers, you could also use combinations from itertools to work through every possible combination of split-points.

    def split_string(s, t):
        return [s[start:finish] for start, finish in zip((None, ) + t, t + (None, ))]
    
    def split_combinations(s):
        for i in range(len(s)):
            for split_points in combinations(range(1, len(s)), i):
                yield split_string(s, split_points)
    

    These both seem to work as intended in Python 2.7 (see here) and Python 3.2 (here). As @twasbrillig says, make sure you indent it as shown.

    0 讨论(0)
提交回复
热议问题