Python: find all possible word combinations with a sequence of characters (word segmentation)

时光毁灭记忆、已成空白 提交于 2020-01-11 06:17:08

问题


I'm doing some word segmentation experiments like the followings.

lst is a sequence of characters, and output is all the possible words.

lst = ['a', 'b', 'c', 'd']

def foo(lst):
    ...
    return output

output = [['a', 'b', 'c', 'd'],
          ['ab', 'c', 'd'],
          ['a', 'bc', 'd'],
          ['a', 'b', 'cd'],
          ['ab', 'cd'],
          ['abc', 'd'],
          ['a', 'bcd'],
          ['abcd']]

I've checked combinations and permutations in itertools library,
and also tried combinatorics.
However, it seems that I'm looking at the wrong side because this is not pure permutation and combinations...

It seems that I can achieve this by using lots of loops, but the efficiency might be low.

EDIT

The word order is important so combinations like ['ba', 'dc'] or ['cd', 'ab'] are not valid.

The order should always be from left to right.

EDIT

@Stuart's solution doesn't work in Python 2.7.6

EDIT

@Stuart's solution does work in Python 2.7.6, see the comments below.


回答1:


itertools.product should indeed be able to help you.

The idea is this:- Consider A1, A2, ..., AN separated by slabs. There will be N-1 slabs. If there is a slab there is a segmentation. If there is no slab, there is a join. Thus, for a given sequence of length N, you should have 2^(N-1) such combinations.

Just like the below

import itertools
lst = ['a', 'b', 'c', 'd']
combinatorics = itertools.product([True, False], repeat=len(lst) - 1)

solution = []
for combination in combinatorics:
    i = 0
    one_such_combination = [lst[i]]
    for slab in combination:
        i += 1
        if not slab: # there is a join
            one_such_combination[-1] += lst[i]
        else:
            one_such_combination += [lst[i]]
    solution.append(one_such_combination)

print solution



回答2:


#!/usr/bin/env python
from itertools import combinations
a = ['a', 'b', 'c', 'd']
a = "".join(a)
cuts = []
for i in range(0,len(a)):
    cuts.extend(combinations(range(1,len(a)),i))
for i in cuts:
    last = 0
    output = []
    for j in i:
        output.append(a[last:j])
        last = j
    output.append(a[last:])
    print(output)

output:

zsh 2419 % ./words.py  
['abcd']
['a', 'bcd']
['ab', 'cd']
['abc', 'd']
['a', 'b', 'cd']
['a', 'bc', 'd']
['ab', 'c', 'd']
['a', 'b', 'c', 'd']



回答3:


There are 8 options, each mirroring the binary numbers 0 through 7:

000
001
010
011
100
101
110
111

Each 0 and 1 represents whether or not the 2 letters at that index are "glued" together. 0 for no, 1 for yes.

>>> lst = ['a', 'b', 'c', 'd']
... output = []
... formatstr = "{{:0{}.0f}}".format(len(lst)-1)
... for i in range(2**(len(lst)-1)):
...     output.append([])
...     s = "{:b}".format(i)
...     s = str(formatstr.format(float(s)))
...     lstcopy = lst[:]
...     for j, c in enumerate(s):
...         if c == "1":
...             lstcopy[j+1] = lstcopy[j] + lstcopy[j+1]
...         else:
...             output[-1].append(lstcopy[j])
...     output[-1].append(lstcopy[-1])
... output
[['a', 'b', 'c', 'd'],
 ['a', 'b', 'cd'],
 ['a', 'bc', 'd'],
 ['a', 'bcd'],
 ['ab', 'c', 'd'],
 ['ab', 'cd'],
 ['abc', 'd'],
 ['abcd']]
>>> 



回答4:


You can use a recursive generator:

def split_combinations(L):
    for split in range(1, len(L)):
        for combination in split_combinations(L[split:]):
            yield [L[:split]] + combination
    yield [L]

print (list(split_combinations('abcd')))

Edit. I'm not sure how well this would scale up for long strings and at what point it hits Python's recursion limit. Similarly to some of the other answers, you could also use combinations from itertools to work through every possible combination of split-points.

def split_string(s, t):
    return [s[start:finish] for start, finish in zip((None, ) + t, t + (None, ))]

def split_combinations(s):
    for i in range(len(s)):
        for split_points in combinations(range(1, len(s)), i):
            yield split_string(s, split_points)

These both seem to work as intended in Python 2.7 (see here) and Python 3.2 (here). As @twasbrillig says, make sure you indent it as shown.



来源:https://stackoverflow.com/questions/27263155/python-find-all-possible-word-combinations-with-a-sequence-of-characters-word

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!