Tokenize valid words from a long string

后端 未结 2 443
攒了一身酷
攒了一身酷 2021-02-02 01:47

Suppose you have a dictionary that contains valid words.

Given an input string with all spaces removed, determine whether the string is composed of valid words or not.<

2条回答
  •  时光取名叫无心
    2021-02-02 02:20

    I'd go for a recursive algorithm with implicit backtracking. Function signature: f: input -> result, with input being the string, result either true or false depending if the entire string can be tokenized correctly.

    Works like this:

    1. If input is the empty string, return true.
    2. Look at the length-one prefix of input (i.e., the first character). If it is in the dictionary, run f on the suffix of input. If that returns true, return true as well.
    3. If the length-one prefix from the previous step is not in the dictionary, or the invocation of f in the previous step returned false, make the prefix longer by one and repeat at step 2. If the prefix cannot be made any longer (already at the end of the string), return false.
    4. Rinse and repeat.

    For dictionaries with low to moderate amount of ambiguous prefixes, this should fetch a pretty good running time in practice (O(n) in the average case, I'd say), though in theory, pathological cases with O(2^n) complexity can probably be constructed. However, I doubt we can do any better since we need backtracking anyways, so the "instinctive" O(n) approach using a conventional pre-computed lexer is out of the question. ...I think.

    EDIT: the estimate for the average-case complexity is likely incorrect, see my comment.

    Space complexity would be only stack space, so O(n) even in the worst-case.

提交回复
热议问题