I have a huge set of arbitrary natural language strings. For my tool to analyze them I need to convert each string to unique color value (RGB or other). I need color contras
Sorry, but you can't do what you're looking for with levenshtein distance or similar. RGB and HSV are 3-dimensional geometric spaces, but levenshtein distance describes a metric space - a much looser set of contstraints with no fixed number of dimensions. There's no way to map a metric space into a fixed number of dimensions while always preserving locality.
As far as approximations go, though, for single terms you could use a modification of an algorithm like soundex or metaphone to pick a color; for multiple terms, you could, for example, apply soundex or metaphone to each word individually, then sum them up (with overflow).
First, you'll need to pick a way to measure string similarity. Minimal edit distance is traditional, but is not sufficient to well-order the strings, which is what you will need if you want to allocate the same colours to the same strings every time - perhaps you could weight the edit costs by alphabetic distance. Also minimal edit distance by itself may not be very useful if what you are after is similarity in speech rather than in written form (if so, consider a stemming/soundex pass first), or some other sense of "similarity".
Then you need to pick a way of traversing the visible colour space based on that metric. It may be helpful to consider using HSL or HSV colour representation - the algorithm could then become as simple as picking a starting hue and walking the sorted corpus, assigning the current hue to each string before offsetting it by the string's difference from the previous one.
Here is my suggestion (I think there is a general name for this algorithm, but I'm too tired to remember it):
You want to transform each string to a 3D point node(r, g, b) (you can scale the values so that they fit your range) such that the following error is minimized:
Error = \sum_i{\sum_j{(dist(node_i, node_j) - dist(str_i, str_j))^2}}
You can do this:
You need to elaborate more on what you mean by "similar strings" in order to come up with an appropriate conversion function. Are the strings
"I am going to the store" and
"We are going to the store"
considered similar? What about the strings
"I am going to the store" and
"J bn hpjoh up uif tupsf"
(all of the letters in the original +1), or
"I am going to the store" and
"I am going to the store today"
? Based on what you mean by "similar", you might consider different functions.
If the difference can be based solely on the values of the characters (in Unicode or whatever space they are from), then you can try summing the values up and using the result as a hue for HSV space. If having a longer string should cause the colours to be more different, you might consider weighing characters by their position in the string.
If the difference is more complex, such as by the occurrences of certain letters or words, then you need to identify this. Maybe you can decide red, green and blue values based on the number of Es, Ss and Rs in a string, if your domain has a lot of these. Or pick a hue based on the ratio of vowels to consonents, or words to syllables.
There are many, many different ways to approach this, but the best one really depends on what you mean by "similar" strings.
It sounds like you want a hash of some sort. It doesn't need to be secure (so nothing as complicated as MD5 or SHA) but something along the lines of:
char1 + char2 + char3 + ... + charN % MAX_COLOUR_VALUE
would work as a simple first step. You could also do fancier things along the lines of having each character act as an 'amplitude' for R,G and B (e could be +1R, +2G and -4B, etc.) and then simply add up all the values in a string... clamp them at the end and you have a method of turning arbitrary length strings into colours as a 'colour hash' sort of process.
You can use something like MinHash or some other LSH method and define similarity as intersection between sets of shingles measured by Jaccard coefficient. There is a good description in Mining of Massive data sets, Ch.3 by Rajaraman and Ullman.