Deduplicating code in slightly different functions

问题

I have two very similar loops, and these two contain an inner loop that is very similar to a third loop (eh... :) ). Illustrated with code it looks close to this:

# First function
def fmeasure_kfold1(array, nfolds):
    ret = []

    # Kfold1 and kfold2 both have this outer loop
    for train_index, test_index in KFold(len(array), nfolds):
        correlation = analyze(array[train_index])

        for build in array[test_index]:  # <- All functions have this loop

            # Retrieved tests is calculated inside the build loop in kfold1
            retrieved_tests = get_tests(set(build['modules']), correlation)

            relevant_tests = set(build['tests'])
            fval = calc_f(relevant_tests, retrieved_tests)
            if fval is not None:
                ret.append(fval)

    return ret

# Second function
def fmeasure_kfold2(array, nfolds):
    ret = []

    # Kfold1 and kfold2 both have this outer loop
    for train_index, test_index in KFold(len(array), nfolds):
        correlation = analyze(array[train_index])

        # Retrieved tests is calculated outside the build loop in kfold2
        retrieved_tests = _sum_tests(correlation)

        for build in array[test_index]:  # <- All functions have this loop

            relevant_tests = set(build['tests'])
            fval = calc_f(relevant_tests, retrieved_tests)
            if fval is not None:
                ret.append(fval)

    return ret

# Third function
def fmeasure_all(array):
    ret = []
    for build in array:  # <- All functions have this loop

        relevant = set(build['tests'])
        fval = calc_f2(relevant)  # <- Instead of calc_f, I call calc_f2
        if fval is not None:
            ret.append(fval)

    return ret

The first two functions only differ in the manner, and at what time, they calculate retrieved_tests. The third function differs from the inner loop of the first two functions in that it calls calc_f2, and doesn't make use of retrieved_tests.

In reality the code is more complex, but while the duplication irked me I figured I could live with it. However, lately I've been making changes to it, and it's annoying to have to change it in two or three places at once.

Is there a good way to merge the duplicated code? The only way I could think of involved introducing classes, which introduces a lot of boilerplate, and I would like to keep the functions as pure functions if possible.

Edit

This is the contents of calc_f and calc_f2:

def calc_f(relevant, retrieved):
    """Calculate the F-measure given relevant and retrieved tests."""
    recall = len(relevant & retrieved)/len(relevant)
    prec = len(relevant & retrieved)/len(retrieved)
    fmeasure = f_measure(recall, prec)

    return (fmeasure, recall, prec)


def calc_f2(relevant, nbr_tests=1000):
    """Calculate the F-measure given relevant tests."""
    recall = 1
    prec = len(relevant) / nbr_tests
    fmeasure = f_measure(recall, prec)

    return (fmeasure, recall, prec)

f_measure calculates the harmonic mean of precision and recall.

Basically, calc_f2 takes a lot of shortcuts since no retrieved tests are needed.

回答1:

Having a common function that takes an extra parameter that controls where to compute retrieved_tests would work too.

e.g.

def fmeasure_kfold_generic(array, nfolds, mode):
    ret = []

    # Kfold1 and kfold2 both have this outer loop
    for train_index, test_index in KFold(len(array), nfolds):
        correlation = analyze(array[train_index])

        # Retrieved tests is calculated outside the build loop in kfold2
        if mode==2:
            retrieved_tests = _sum_tests(correlation)

        for build in array[test_index]:  # <- All functions have this loop
            # Retrieved tests is calculated inside the build loop in kfold1
            if mode==1:
                retrieved_tests = get_tests(set(build['modules']), correlation)

            relevant_tests = set(build['tests'])
            fval = calc_f(relevant_tests, retrieved_tests)
            if fval is not None:
                ret.append(fval)

回答2:

One way is to write the inner loops each as a function, and then have the outer loop as a separate function that receives the others as an argument. This is something close to what is done in sorting functions (that receive the function that should be used to compare two elements).

Of course, the hard part is to find what exactly is the common part between all functions, which is not always simple.

回答3:

Typical solution would be to identify parts of algorithm and use Template method design pattern where different stages would be implemented in subclasses. I do not understand your code at all, but I assume there would be methods like makeGlobalRetrievedTests() and makeIndividualRetrievedTests()?

回答4:

I'd approach the problem inside-out: by factoring out the innermost loop. This works well with a 'functional' style (as well as 'functional programming'). It seems to me that if you generalize fmeasure_all a bit you could implement all three functions in terms of that. Something like

def fmeasure(builds, calcFn, retrieveFn):
    ret = []
    for build in array:
        relevant = set(build['tests'])
        fval = calcFn(relevant, retrieveFn(build))
        if fval is not None:
            ret.append(fval)

    return ret

This allows you to define:

def fmeasure_kfold1(array, nfolds):
    ret = []

    # Kfold1 and kfold2 both have this outer loop
    for train_index, test_index in KFold(len(array), nfolds):
        correlation = analyze(array[train_index])

        ret += fmeasure(array[test_index], calc_f,
                        lambda build: get_tests(set(build['modules']), correlation))

    return ret


def fmeasure_kfold2(array, nfolds):
    ret = []

    # Kfold1 and kfold2 both have this outer loop
    for train_index, test_index in KFold(len(array), nfolds):
        correlation = analyze(array[train_index])

        # Retrieved tests is calculated outside the build loop in kfold2
        retrieved_tests = _sum_tests(correlation)

        ret += fmeasure(array[test_index], calc_f, lambda _: retrieved_tests)

    return ret


def fmeasure_all(array):
    return fmeasure(array,
                    lambda relevant, _: calc_f2(relevant),
                    lambda x: x)

By now, fmeasure_kfold1 and fmeasure_kfold2 look awfully similiar. They mostly differ in how fmeasure is called, so we can implement a generic fmeasure_kfoldn function which centralizes the iteration and collecting the results:

def fmeasure_kfoldn(array, nfolds, callable):
    ret = []
    for train_index, test_index in KFold(len(array), nfolds):
        correlation = analyze(array[train_index])
        ret += callable(array[test_index], correlation)
    return ret

This allows defining fmeasure_kfold1 and fmeasure_kfold2 very easily:

def fmeasure_kfold1(array, nfolds):
    def measure(builds, correlation):
        return fmeasure(builds, calc_f, lambda build: get_tests(set(build['modules']), correlation))
    return fmeasure_kfoldn(array, nfolds, measure)


def fmeasure_kfold2(array, nfolds):
    def measure(builds, correlation):
        retrieved_tests = _sum_tests(correlation)
        return fmeasure(builds, calc_f, lambda _: retrieved_tests)
    return fmeasure_kfoldn(array, nfolds, measure)

来源：https://stackoverflow.com/questions/28562765/deduplicating-code-in-slightly-different-functions

标签

python

code-duplication

code-design