How to determine a strings dna for likeness to another

后端 未结 6 2032
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-28 20:44

I am hoping I am wording this correctly to get across what I am looking for.

I need to compare two pieces of text. If the two strings are alike I would like to get s

相关标签:
6条回答
  • 2020-12-28 20:47

    Check their Levenshtein Distance

    In PHP you even have the levenshtein() function that makes exactly that.

    0 讨论(0)
  • 2020-12-28 20:48

    I need to compare two pieces of text. If the two strings are alike I would like to get scores that are very alike if the strings are very different i need scores that are very different.

    It really depends on what you mean by "same" or "different". For example, if someone replaces "United States of America" with "USA" in your string, is that mostly the same string (because USA is just an abbreviation for something longer), or is it very different (because a lot of characters changed)?

    You essentially need to either devise a function that describes how to compute "sameness" or use a pre-existing definition thereof. For example, the aforementioned Levenshtein distance measures total difference based on the number of changes you have to make to get to the original string.

    0 讨论(0)
  • 2020-12-28 20:56

    What you're looking for is a LCS algorithm (see also Levenshtein distance). You may also try Soundex or some other phonetic algorithm.

    0 讨论(0)
  • 2020-12-28 21:03

    Since the Levenshtein distance needs both input strings to produce a value, you would have to store all strings.

    You could, however, use a small number of strings as markers and only store these as strings.

    You would then calculate the Levenshtein distance from a new string to each of these marker strings and store these values. You could then guess that two strings that have a similar Levenshtein distance to all markers are also similar to each other. It would likely be sensible to "engineer" these markers in such a way that their mutual Levenshtein distance is as large as possible. I don't know whether there has been some research in this direction.

    0 讨论(0)
  • 2020-12-28 21:03

    Many people have suggested looking at distance/metric like approaches, and I think the wording of the question leads that way. (By the way, a hash like md5 is trying to do pretty much the opposite thing that a metric does, so it's hardly surprising that this wouldn't work for you. There are similar ideas that don't change much under small deltas, but I suspect they don't encode enough information for what you want to do)

    Particularly given your update in the comments though, I think this type of approach is not very helpful.

    What you are looking for is more of a clustering problem, where you want to generate a signature (i.e. feature vector) from each email and later compare it to new inputs. So essentially what you have is a machine learning problem. Deciding what "close" means may be a bit of a challenge. To get started though, assuming it actually is emails you're looking at you may do well to look at the sorts of feature generation done by many spam-filters, this will give you (probably Euclidean, at least to start) a space to measure distances in based on a signature (feature vector).

    Without knowing more about your problem it's hard to be more specific.

    0 讨论(0)
  • 2020-12-28 21:10

    Reading your comments, it sounds like you are actually trying to compare entire documents, each containing many words.

    This is done successfully in information retrieval systems by treating documents as N-dimensional points in space. Each word in the language is an axis. The distance along the axis is determined by the number of times that word appears in the document. Similar documents are then "near" each other in space.

    This way, the whole document doesn't need to be stored, just its word counts. And usually the most common words in the language are not counted at all.

    0 讨论(0)
提交回复
热议问题