I am trying to run a simulation to test the average Levenshtein distance between random binary strings.
My program is in python but I am using this C extension. The fun
What I'd do:
1) Very small optimization: allocate once and for all row
to avoid memory management overhead. Or you may try realloc()
, or you could keep track of row
's size in a static variable (and have row
static as well). This saves very little, however, even if it costs little to put in place.
2) You are trying to calculate an average. Do the average calculation in C as well. This ought to save something in calls. Again, small change, but it comes cheap.
3) Since you're not interested in the actual calculations but only in the results, then, say you have three PC's and each of them is a quad-core machine. Then run on each of them four instances of the program, with the loop being twelve times shorter. You will get twelve results in one twelfth of the time: average those, and Bob's your uncle.
Option #3 requires no modifications at all except for the cycle, and you may want to make it a command line parameter, so that you can deploy the program on a variable number of computers. Actually, you may want to output both the result and its "weight", to minimize chances of errors when you sum the results together.
for j in xrange(N):
str1 = bin(random.getrandbits(2**i))[2:].zfill(2**i)
str2 = bin(random.getrandbits(2**i))[2:].zfill(2**i)
sum += distance(str1,str2)
print N,i,sum/(N*2**i)
But if you're interested in a generic Levenshtein statistic, I'm not so sure that doing the calculation with only 0 and 1 symbols is suitable to your purpose. From the string 01010101, you get 10101010 either by flipping eight characters or by dropping the first and adding a zero at the end, with two different costs. If you have all the letters of the alphabet, the second possibility becomes much less likely, and this ought to change something in the average cost scenario. Or am I missing something?