Looking for fast sorted integer array intersection/union algorithms implemented in C [closed]

问题

I am looking for C algorithms (or code) that implement fast sorted integer array intersection/union operations. The faster, the better.

In other words, what's an efficient way in C to implement union and intersection operations between two arrays of integers?

回答1:

This is not quite the fastest, but it demonstrates the right algorithm.

//List is an output variable, it could be a 
//linked list or an array list or really anything you can add at end to.  

void intersection(int A[], int B[], int aLen, int bLen, List output)
{
    int i = 0;
    int j = 0;
    while (i < aLen && j < bLen)
    {
        a = A[i];
        b = B[j];
        if (a == b)
        {
            add(output, a);
            i++;
            j++;
        }
        else if (a < b)
        {
            i++;
        }
        else
        {
            j++;
        }
    }
}

Above algorithm is O(aLen + bLen)

You can do better, especially when it comes to the problem of intersecting more than 2 lists.

For intersection the basic algorithm is to iterate through all sorted lists to be intersected at the same time. If the heads of all the lists match, move to the next element in all lists and add the head to the intersection. If not, find the maximum element visible, and attempt to find that element in all other lists.

In my code example, I just keep iterating, but since these are sorted lists, if you expect A to be the numbers 1 through 10000 and B to be the set {7777}, you can also binary search to the correct index. Finding the maximum element with multiple lists means using a heap if you want to do it properly.

If you make the binary search change, your worst case will go up to O((aLen + bLen) * (lg(aLen + bLen)), but depending on data, your average case might drastically improve.

The heap change will be necessary when intersecting many sets together, as the above algorithm becomes O(numLists * (total number of elements in all lists)) and can be reduced to O(lg (numLists) * (total number of elements in all lists))

void union(int A[], int B[], int aLen, int bLen, List output)
    {
        int i = 0;
        int j = 0;
        while (i < aLen && j < bLen)
        {
            a = A[i];
            b = B[j];
            if (a == b)
            {
                add(output, a);
                i++;
                j++;
            }
            else if (a < b)
            {
                add(output, a);
                i++;
            }
            else
            {
                add(output, b);
                j++;
            }
        }
        //Add any leftovers.  
        for (;i < aLen; i++)
        {
            add(output, A[i]);
        }
        for (;j < bLen; j++)
        {
            add(output, B[j]);
        }
    }

Union is basically the same algorithm, except you always add every element, and as such, theres no point whatsoever in binary searching. Extending it to multiple lists can be done with a heap that implements peek, basic rule is to always add the smallest element, and step forward in every list that had that element at the front.

回答2:

Assuming that these are actual sets (since an intersection on arrays with duplicates is problematic at best), the following code may help.

First, some requisite headers:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

and some mandatory documentation:

// Description:
//     Will give back the union or intersection of two sorted integer
//         sets (only one copy of each element).
//     Caller responsible for freeing memory.
// Input:
//     Union ('u') or intersection (anything else).
//     arr1 and arr2 are pointers to the arrays.
//     sz1 and sz2 are the number of integers in each array.
//     psz3 as the pointer to the variable to receive the
//         size of the returned array.
// Returns:
//     A pointer to the array, or NULL if malloc failed.
//     Memory allocated even if result set is empty, so
//        NULL return indicates ONLY malloc failure.
//     psz3 receives the size of that array, which is
//        zero for an empty set.

Then the function proper:

int *arrUnionIntersect ( int type,
    int *arr1, size_t sz1,
    int *arr2, size_t sz2,
    size_t *psz3
) {
    int *arr3, *ptr1, *ptr2;

    *psz3 = 0;

    // If result will be empty, just return dummy block.

    if ((sz1 == 0) && (sz2 == 0))
        return malloc (1);

    // Otherwise allocate space for real result.

    if (type == 'u')
        arr3 = malloc ((sz1 + sz2) * sizeof (*arr1));
    else
        if (sz1 > sz2)
            arr3 = malloc (sz1 * sizeof (*arr1));
        else
            arr3 = malloc (sz2 * sizeof (*arr1));
    if (arr3 == NULL)
        return NULL;

Up until there, it's mostly initialisation for the function. This following bit traverses the two input sets, selecting what gets added to the result. This is best done (in my opinion) as three phases, the first being selection of an element when both input sets still have some remaining. Note the differing behaviour here for unions and intersections, specifically intersections only have the element added if it's in both input sets:

    // Set up pointers for input processing.

    ptr1 = arr1;
    ptr2 = arr2;

    // Phase A: select appropriate number from either, when both
    //    have remaining elements.

    while ((sz1 > 0) && (sz2 > 0)) {
        if (*ptr1 == *ptr2) {
            arr3[(*psz3)++] = *ptr1++;
            sz1--;
            ptr2++;
            sz2--;
            continue;
        }

        // We don't copy for intersect where elements are different.

        if (*ptr1 < *ptr2) {
            if (type == 'u')
                arr3[(*psz3)++] = *ptr1;
            ptr1++;
            sz1--;
            continue;
        }

        if (type == 'u')
            arr3[(*psz3)++] = *ptr2;
        ptr2++;
        sz2--;
    }

The other two phases (of which only one will run for unions, and none for intersections), simply gets the remaining items from the non-empty input set:

    // Phase B and C are only for unions.

    if (type == 'u') {
        // Phase B: process rest of arr1 if arr2 ran out.

        while (sz1 > 0) {
            arr3[*psz3++] = *ptr1++;
            sz1--;
        }

        // Phase C: process rest of arr2 if arr1 ran out.

        while (sz2 > 0) {
            arr3[*psz3++] = *ptr2++;
            sz2--;
        }
    }

    // Return the union.

    return arr3;
}

And a test program:

int main (void) {
    int x1[] = {1, 3, 5, 7, 9, 11, 13, 15, 17, 19};
    int x2[] = {2, 3, 5, 7, 11, 13, 17, 19};
    size_t i, sz3;
    int *x3;

    x3 = arrUnionIntersect ('u', x1, sizeof(x1)/sizeof(*x1),
        x2, sizeof(x2)/sizeof(*x2), &sz3);
    printf ("union =");
    for (i = 0; i < sz3; i++)
        printf (" %d", x3[i]);
    free (x3);
    printf ("\n");

    x3 = arrUnionIntersect ('i', x1, sizeof(x1)/sizeof(*x1),
        x2, sizeof(x2)/sizeof(*x2), &sz3);
    printf ("intersection =");
    for (i = 0; i < sz3; i++)
        printf (" %d", x3[i]);
    free (x3);
    printf ("\n");

    return 0;
}

along with its output, as expected:

union = 1 2 3 5 7 9 11 13 15 17 19
intersection = 3 5 7 11 13 17 19

来源：https://stackoverflow.com/questions/8890154/looking-for-fast-sorted-integer-array-intersection-union-algorithms-implemented

标签

algorithm

intersection