What techniques/tools are there for discovering common phrases in chunks of text?

后端 未结 3 420
我寻月下人不归
我寻月下人不归 2021-01-02 17:28

Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like \"the quick brown fox jumps over the lazy dog\" or \"lorem ipsum dolor sit amet\

相关标签:
3条回答
  • 2021-01-02 18:06

    Something like this might work, depending on whether you care about word boundaries. In pseudo-code (where LCS is a function for computing the Longest Common Subsequence):

    someMinimumLengthParameter = 20;
    foundPhrases = [];
    
    do {
        lcs = LCS(mailbodies);
        if (lcs in ignoredPhrases) continue;
    
        foundPhrases += lcs;
    
        for body in mailbodies {
            body.remove(lcs);
        }    
    } while(lcs.length > someMinimumLengthParameter);
    
    0 讨论(0)
  • 2021-01-02 18:12

    I'm not sure if this what you want but check out longest common substring problem and diff utility algorithms.

    0 讨论(0)
  • 2021-01-02 18:22

    Have a look at N-grams. The most common phrases will necessarily contribute the most common N-grams. I'd start out with word trigrams and see where that leads. (Space required is N times the length of the text, so you can't let N get too big.) If you save the positions and not just a count, you can then see if the trigrams can be extended to form common phrases.

    0 讨论(0)
提交回复
热议问题