Anyone know an example algorithm for word segmentation using dynamic programming? [closed]

喜夏-厌秋 提交于 2019-12-21 21:34:24

问题


If you search google for word segmentation there really are no very good descriptions of it and I'm just trying to fully understand the process a dynamic programming algorithm takes to find a segmentation of a string into individual words. Does anyone know a place where there is a good description of a word segmentation problem or can anyone describe it?

Word Segmentation is basically just taking a string of characters and deciding where to split it up into words if you didn't know and using dynamic programming it would take into account some amount of subproblems. This is pretty simple to do using recursion but I haven't been able to find anywhere online to find even just a description of an iterative algorithm for this online, so if anyone has any examples or can give an algorithm that would be great.

Thanks for any help.


回答1:


I'm going to assume that we're not talking about the trivial case here (i.e. not just splitting a string around spaces, since that'd just be a basic tokenizer problem) - but instead, we're talking about something were there isn't a clear word delimiter character, and thus we're having to "guess" what the best match for string->words would be - for instance, the case of a set of concatenated words w/o spaces, such as transforming this:

lotsofwordstogether

into this:

lots, of, words, together

In this case, the dynamic programming approach would probably be to calculate out a table where one dimension corresponds to the Mth word in the sequence, and the other dimension corresponds to each Nth character in the input string. Then the value that you fill in for each square of the table is "the best match score we can get if we end (or instead, begin) the Mth word at position N.




回答2:


The Python wordsegment module has such an algorithm. It uses recursion and memoization to implement dynamic programming.

The source is available on Github, here's the relevant snippet:

def segment(text):
    "Return a list of words that is the best segmenation of `text`."

    memo = dict()

    def search(text, prev='<s>'):
        if text == '':
            return 0.0, []

        def candidates():
            for prefix, suffix in divide(text):
                prefix_score = log10(score(prefix, prev))

                pair = (suffix, prefix)
                if pair not in memo:
                    memo[pair] = search(suffix, prefix)
                suffix_score, suffix_words = memo[pair]

                yield (prefix_score + suffix_score, [prefix] + suffix_words)

        return max(candidates())

    result_score, result_words = search(clean(text))

    return result_words

Note how memo caches calls to search which in turn selects the max from candidates.




回答3:


Here is the followings solution in iterative style (The main idea is breaking up problem into: finding segmentation having exactly 1,2,3..n segmented words within a certain range of the input. Excuse me if there are any minor indexing mistakes, my head is very fuzzy these days. But this is an iterative solution for you.):

static List<String> connectIter(List<String> tokens) {

    // use instead of array, the key is of the from 'int int'
    Map<String, List<String>> data = new HashMap<String, List<String>>();

    int n = tokens.size();

    for(int i = 0; i < n; i++) {
        String str = concat(tokens, i, n);
        if (dictContains(str)) {
            data.put(1 + " " + i, Collections.singletonList(str));
        }
    }

    for (int i = 2; i <= n; i++) {
        for (int start = 0; i < n; start++) {
            List<String> solutions = new ArrayList<String>();
            for (int end = start + 1; end <= n - i + 1; end++) {
                String firstPart = concat(tokens, start, end);

                if (dictContains(firstPart)) {
                    List<String> partialSolutions = data.get((i - 1) + " " + end);
                    if (partialSolutions != null) {
                        List<String> newSolutions = new ArrayList<>();
                        for (String onePartialSolution : partialSolutions) {
                            newSolutions.add(firstPart + " "
                                    + onePartialSolution);
                        }
                        solutions.addAll(newSolutions);
                    }
                }
            }

            if (solutions.size() != 0) {
                data.put(i + " " + start, solutions);
            }
        }
    }

    List<String> ret = new ArrayList<String>();
    for(int i = 1; i <= n; i++) { // can be optimized to run with less iterations
        List<String> solutions = data.get(i + " " + 0);
        if (solutions != null) {
            ret.addAll(solutions);
        }
    }

    return ret;
}


static private String concat(List<String> tokens, int low, int hi) {
    StringBuilder sb = new StringBuilder();
    for(int i = low; i < hi; i++) {
        sb.append(tokens.get(i));
    }
    return sb.toString();
}


来源:https://stackoverflow.com/questions/1781647/anyone-know-an-example-algorithm-for-word-segmentation-using-dynamic-programming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!