What techniques/tools are there for discovering common phrases in chunks of text?

后端 未结 3 425
我寻月下人不归
我寻月下人不归 2021-01-02 17:28

Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like \"the quick brown fox jumps over the lazy dog\" or \"lorem ipsum dolor sit amet\

3条回答
  •  醉梦人生
    2021-01-02 18:22

    Have a look at N-grams. The most common phrases will necessarily contribute the most common N-grams. I'd start out with word trigrams and see where that leads. (Space required is N times the length of the text, so you can't let N get too big.) If you save the positions and not just a count, you can then see if the trigrams can be extended to form common phrases.

提交回复
热议问题