Function that returns affinity between texts?

旧巷老猫 提交于 2019-11-30 02:23:09

A Dynamic Programing Algorithm

It seems what you are looking for is very similar to what the Smith–Waterman algorithm does.

From Wikipedia:

The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981. Like the Needleman-Wunsch algorithm, of which it is a variation, Smith-Waterman is a dynamic programming algorithm. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme).

Let's see a practical example, so you can evaluate its usefulness.

Suppose we have a text:

text = "We the people of the United States, in order to form a more 
perfect union, establish justice, insure domestic tranquility, 
provide for the common defense, 

  promote the general welfare, 

  and secure the blessings of liberty to ourselves and our posterity, 
do ordain and establish this Constitution for the United States of 
America.";  

I isolated the segment we are going to match, just for your easy of reading.

We will compare the affinity (or similarity) with a list of strings:

list = {
   "the general welfare",
   "my personal welfare",
   "general utopian welfare",
   "the general",
   "promote welfare",
   "stackoverflow rulez"
   };  

I have the algorithm already implemented, so I'll calculate the similarity and normalize the results:

sw = SmithWatermanSimilarity[ text, #] & /@ list;
swN = (sw - Min[sw])/(Max[sw] - Min[sw])  

Then we Plot the results:

I think it's very similar to your expected result.

HTH!

Some implementations (w/source code)

Take a look into creating N-grams out of your input data and then matching on the N-grams. I have a solution where I regard each n-gram as a dimension in a vector space (which becomes a space of 4000 dimensions in my case) and then affinity is the cosine of the angle between two vectors (the dot-product is involved here).

The hard part is to come up with a metric defining the affinity in a way you want.

An alternative is to look at a sliding window and score based on how many words in your compare_x data is in the window. The final score is the sum.

payne

py-editdist will give you the Levenshtein edit distance between two strings, which is one metric that might be helpful.

See: http://www.mindrot.org/projects/py-editdist/

The code example from that page:

import editdist

# Calculate the edit distance between two strings
d = editdist.distance("abc", "bcdef")

Related: https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

I think there is a pretty good and complete answer to this question here http://answers.google.com/answers/threadview?id=337832

Sorry its on google answers!

Here you can find a list of metrics to calculate distance between strings, and an opensource java library that just do that. http://en.wikipedia.org/wiki/String_metric In particular, take a look at the Smith–Waterman algorithm, keeping in mind that what they call "Alphabet" can be composed by what we call Strings : so, given the alphabet

{A = "hello", B = "hi",C = "goodmorning",D = "evening"}

and called d the distance, your function tries to calculate

d(ABCD,AB) vs d(ABCD,AC)

Well, you can count the occurrences of pieces of the comparing text, ie:

"a-b-c" -> "a" , "b" , "c" , "a-b" ," b-c" , "a-b-c" (possible "a-c", if you wanted that)

And then count occurrences of each of those, and sum them, possibly with a weight of (length of string) / (length of whole string).

Then you just need a way to produce those pieces, and run a check for all of them.

While the Levenshtein distance as it stands may not suit your purposes, a modification of it might: Try implementing it by storing the insertions, deletions, and substitutions separately.

The distance will then be the sum of the following:

  • All Substutions
  • The number of spaces minus one in each set of consecutive insertions/deletions:
    • (In your case: " hi goodmorning " only counts as two edits, and ' [...] ' counts as one.)

You'd have to test this, of course, but if it doesn't work well try simply using the sum of consecutive insertions/deletions (so, " hi good morning " is only 1 edit).

EDIT

P.S.: this assumes a relatively major change to how Levenshtein works, you'd want to 'align' your data first (finding out where there's significant (more than two characters) overlap and inserting 'null' characters that would count as insertions).

Also, this is just an untested idea, so any ideas for improvements are welcome.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!