What techniques/tools are there for discovering common phrases in chunks of text?

后端未结

关注

 3  429

Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like \"the quick brown fox jumps over the lazy dog\" or \"lorem ipsum dolor sit amet\

相关标签:

3条回答

旧巷少年郎

2021-01-02 18:06

Something like this might work, depending on whether you care about word boundaries. In pseudo-code (where LCS is a function for computing the Longest Common Subsequence):

someMinimumLengthParameter = 20;
foundPhrases = [];

do {
    lcs = LCS(mailbodies);
    if (lcs in ignoredPhrases) continue;

    foundPhrases += lcs;

    for body in mailbodies {
        body.remove(lcs);
    }    
} while(lcs.length > someMinimumLengthParameter);

0 讨论(0)

醉梦人生

2021-01-02 18:12

I'm not sure if this what you want but check out longest common substring problem and diff utility algorithms.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2021-01-02 18:22

Have a look at N-grams. The most common phrases will necessarily contribute the most common N-grams. I'd start out with word trigrams and see where that leads. (Space required is N times the length of the text, so you can't let N get too big.) If you save the positions and not just a count, you can then see if the trigrams can be extended to form common phrases.

0 讨论(0)
发布评论:

提交评论
- 加载中...