Find the words in a long stream of characters. Auto-tokenize

前端 未结 5 2042
梦谈多话
梦谈多话 2021-02-04 09:44

How would you find the correct words in a long stream of characters?

Input :

\"The revised report onthesyntactictheoriesofsequentialcontrolandstate\"
         


        
5条回答
  •  余生分开走
    2021-02-04 10:23

    Here is a code in Mathematica I started to develop for a recent code golf.
    It is a minimal matching, non greedy, recursive algorithm. That means that the sentence "the pen is mighter than the sword" (without spaces) returns {"the pen is might er than the sword} :)

    findAll[s_] :=
      Module[{a = s, b = "", c, sy = "="},
      While[
       StringLength[a] != 0,
       j = "";
       While[(c = findFirst[a]) == {} && StringLength[a] != 0,
        j = j <> StringTake[a, 1];
        sy = "~";
        a = StringDrop[a, 1];
       ];
       b = b <> " " <> j ;
       If[c != {},
        b = b <> " " <> c[[1]];
        a = StringDrop[a, StringLength[c[[1]]]];
       ];
      ];
       Return[{StringTrim[StringReplace[b, "  " -> " "]], sy}];
    ]
    
    findFirst[s_] :=
      If[s != "" && (c = DictionaryLookup[s]) == {}, 
       findFirst[StringDrop[s, -1]], Return[c]];
    

    Sample Input

    ss = {"twodreamstop", 
          "onebackstop", 
          "butterfingers", 
          "dependentrelationship", 
          "payperiodmatchcode", 
          "labordistributioncodedesc", 
          "benefitcalcrulecodedesc", 
          "psaddresstype", 
          "ageconrolnoticeperiod",
          "month05", 
          "as_benefits", 
          "fname"}
    

    Output

     twodreamstop              = two dreams top
     onebackstop               = one backstop
     butterfingers             = butterfingers
     dependentrelationship     = dependent relationship
     payperiodmatchcode        = pay period match code
     labordistributioncodedesc ~ labor distribution coded es c
     benefitcalcrulecodedesc   ~ benefit c a lc rule coded es c
     psaddresstype             ~ p sad dress type
     ageconrolnoticeperiod     ~ age con rol notice period
     month05                   ~ month 05
     as_benefits               ~ as _ benefits
     fname                     ~ f name
    

    HTH

提交回复
热议问题