Find the words in a long stream of characters. Auto-tokenize

前端 未结 5 2040
梦谈多话
梦谈多话 2021-02-04 09:44

How would you find the correct words in a long stream of characters?

Input :

\"The revised report onthesyntactictheoriesofsequentialcontrolandstate\"
         


        
5条回答
  •  走了就别回头了
    2021-02-04 10:22

    I would try a recursive algorithm like this:

    • Try inserting a space at each position. If the left part is a word, then recur on the right part.
    • Count the number of valid words / number of total words in all the final outputs. The one with the best ratio is likely your answer.

    For example, giving it "thesentenceisgood" would run:

    thesentenceisgood
    the sentenceisgood
        sent enceisgood
             enceisgood: OUT1: the sent enceisgood, 2/3
        sentence isgood
                 is good
                    go od: OUT2: the sentence is go od, 4/5
                 is good: OUT3: the sentence is good, 4/4
        sentenceisgood: OUT4: the sentenceisgood, 1/2
    these ntenceisgood
          ntenceisgood: OUT5: these ntenceisgood, 1/2
    

    So you would pick OUT3 as the answer.

提交回复
热议问题