问题
I have two very similar loops, and these two contain an inner loop that is very similar to a third loop (eh... :) ). Illustrated with code it looks close to this:
# First function
def fmeasure_kfold1(array, nfolds):
ret = []
# Kfold1 and kfold2 both have this outer loop
for train_index, test_index in KFold(len(array), nfolds):
correlation = analyze(array[train_index])
for build in array[test_index]: # <- All functions have this loop
# Retrieved tests is calculated inside the build loop in kfold1
retrieved_tests = get_tests(set(build['modules']), correlation)
relevant_tests = set(build['tests'])
fval = calc_f(relevant_tests, retrieved_tests)
if fval is not None:
ret.append(fval)
return ret
# Second function
def fmeasure_kfold2(array, nfolds):
ret = []
# Kfold1 and kfold2 both have this outer loop
for train_index, test_index in KFold(len(array), nfolds):
correlation = analyze(array[train_index])
# Retrieved tests is calculated outside the build loop in kfold2
retrieved_tests = _sum_tests(correlation)
for build in array[test_index]: # <- All functions have this loop
relevant_tests = set(build['tests'])
fval = calc_f(relevant_tests, retrieved_tests)
if fval is not None:
ret.append(fval)
return ret
# Third function
def fmeasure_all(array):
ret = []
for build in array: # <- All functions have this loop
relevant = set(build['tests'])
fval = calc_f2(relevant) # <- Instead of calc_f, I call calc_f2
if fval is not None:
ret.append(fval)
return ret
The first two functions only differ in the manner, and at what time, they calculate retrieved_tests
. The third function differs from the inner loop of the first two functions in that it calls calc_f2
, and doesn't make use of retrieved_tests
.
In reality the code is more complex, but while the duplication irked me I figured I could live with it. However, lately I've been making changes to it, and it's annoying to have to change it in two or three places at once.
Is there a good way to merge the duplicated code? The only way I could think of involved introducing classes, which introduces a lot of boilerplate, and I would like to keep the functions as pure functions if possible.
Edit
This is the contents of calc_f
and calc_f2
:
def calc_f(relevant, retrieved):
"""Calculate the F-measure given relevant and retrieved tests."""
recall = len(relevant & retrieved)/len(relevant)
prec = len(relevant & retrieved)/len(retrieved)
fmeasure = f_measure(recall, prec)
return (fmeasure, recall, prec)
def calc_f2(relevant, nbr_tests=1000):
"""Calculate the F-measure given relevant tests."""
recall = 1
prec = len(relevant) / nbr_tests
fmeasure = f_measure(recall, prec)
return (fmeasure, recall, prec)
f_measure
calculates the harmonic mean of precision and recall.
Basically, calc_f2
takes a lot of shortcuts since no retrieved tests are needed.
回答1:
Having a common function that takes an extra parameter that controls where to compute retrieved_tests
would work too.
e.g.
def fmeasure_kfold_generic(array, nfolds, mode):
ret = []
# Kfold1 and kfold2 both have this outer loop
for train_index, test_index in KFold(len(array), nfolds):
correlation = analyze(array[train_index])
# Retrieved tests is calculated outside the build loop in kfold2
if mode==2:
retrieved_tests = _sum_tests(correlation)
for build in array[test_index]: # <- All functions have this loop
# Retrieved tests is calculated inside the build loop in kfold1
if mode==1:
retrieved_tests = get_tests(set(build['modules']), correlation)
relevant_tests = set(build['tests'])
fval = calc_f(relevant_tests, retrieved_tests)
if fval is not None:
ret.append(fval)
回答2:
One way is to write the inner loops each as a function, and then have the outer loop as a separate function that receives the others as an argument. This is something close to what is done in sorting functions (that receive the function that should be used to compare two elements).
Of course, the hard part is to find what exactly is the common part between all functions, which is not always simple.
回答3:
Typical solution would be to identify parts of algorithm and use Template method design pattern where different stages would be implemented in subclasses. I do not understand your code at all, but I assume there would be methods like makeGlobalRetrievedTests()
and makeIndividualRetrievedTests()
?
回答4:
I'd approach the problem inside-out: by factoring out the innermost loop. This works well with a 'functional' style (as well as 'functional programming'). It seems to me that if you generalize fmeasure_all
a bit you could implement all three functions in terms of that. Something like
def fmeasure(builds, calcFn, retrieveFn):
ret = []
for build in array:
relevant = set(build['tests'])
fval = calcFn(relevant, retrieveFn(build))
if fval is not None:
ret.append(fval)
return ret
This allows you to define:
def fmeasure_kfold1(array, nfolds):
ret = []
# Kfold1 and kfold2 both have this outer loop
for train_index, test_index in KFold(len(array), nfolds):
correlation = analyze(array[train_index])
ret += fmeasure(array[test_index], calc_f,
lambda build: get_tests(set(build['modules']), correlation))
return ret
def fmeasure_kfold2(array, nfolds):
ret = []
# Kfold1 and kfold2 both have this outer loop
for train_index, test_index in KFold(len(array), nfolds):
correlation = analyze(array[train_index])
# Retrieved tests is calculated outside the build loop in kfold2
retrieved_tests = _sum_tests(correlation)
ret += fmeasure(array[test_index], calc_f, lambda _: retrieved_tests)
return ret
def fmeasure_all(array):
return fmeasure(array,
lambda relevant, _: calc_f2(relevant),
lambda x: x)
By now, fmeasure_kfold1
and fmeasure_kfold2
look awfully similiar. They mostly differ in how fmeasure
is called, so we can implement a generic fmeasure_kfoldn
function which centralizes the iteration and collecting the results:
def fmeasure_kfoldn(array, nfolds, callable):
ret = []
for train_index, test_index in KFold(len(array), nfolds):
correlation = analyze(array[train_index])
ret += callable(array[test_index], correlation)
return ret
This allows defining fmeasure_kfold1
and fmeasure_kfold2
very easily:
def fmeasure_kfold1(array, nfolds):
def measure(builds, correlation):
return fmeasure(builds, calc_f, lambda build: get_tests(set(build['modules']), correlation))
return fmeasure_kfoldn(array, nfolds, measure)
def fmeasure_kfold2(array, nfolds):
def measure(builds, correlation):
retrieved_tests = _sum_tests(correlation)
return fmeasure(builds, calc_f, lambda _: retrieved_tests)
return fmeasure_kfoldn(array, nfolds, measure)
来源:https://stackoverflow.com/questions/28562765/deduplicating-code-in-slightly-different-functions