Yesterday I was pairing the socks from the clean laundry and figured out the way I was doing it is not very efficient. I was doing a naive search — picking one sock and
Socks, whether real ones or some analogous data structure, would be supplied in pairs.
The simplest answer is prior to allowing the pair to be separated, a single data structure for the pair should have been initialized that contained a pointer to the left and right sock, thus enabling socks to be referred to directly or via their pair. A sock may also be extended to contain a pointer to its partner.
This solves any computational pairing problem by removing it with a layer of abstraction.
Applying the same idea to the practical problem of pairing socks, the apparent answer is: don't allow your socks to ever be unpaired. Socks are provided as a pair, put in the drawer as a pair (perhaps by balling them together), worn as a pair. But the point where unpairing is possible is in the washer, so all that's required is a physical mechanism that allows the socks to stay together and be washed efficiently.
There are two physical possibilities:
For a 'pair' object that keeps a pointer to each sock we could have a cloth bag that we use to keep the socks together. This seems like massive overhead.
But for each sock to keep a reference to the other, there is a neat solution: a popper (or a 'snap button' if you're American), such as these:
http://www.aliexpress.com/compare/compare-invisible-snap-buttons.html
Then all you do is snap your socks together right after you take them off and put them in your washing basket, and again you've removed the problem of needing to pair your socks with a physical abstraction of the 'pair' concept.
Here's an Omega(n log n) lower bound in comparison based model. (The only valid operation is comparing two socks.)
Suppose that you know that your 2n socks are arranged this way:
p1 p2 p3 ... pn pf(1) pf(2) ... pf(n)
where f is an unknown permutation of the set {1,2,...,n}. Knowing this cannot make the problem harder. There are n! possible outputs (matchings between first and second half), which means you need log(n!) = Omega(n log n) comparisons. This is obtainable by sorting.
Since you are interested in connections to element distinctness problem: proving the Omega(n log n) bound for element distinctness is harder, because the output is binary yes/no. Here, the output has to be a matching and the number of possible outputs suffices to get a decent bound. However, there's a variant connected to element distinctness. Suppose you are given 2n socks and wonder if they can be uniquely paired. You can get a reduction from ED by sending (a1, a2, ..., an) to (a1, a1, a2, a2, ..., an, an). (Parenthetically, the proof of hardness of ED is very interesting, via topology.)
I think that there should be an Omega(n2) bound for the original problem if you allow equality tests only. My intuition is: Consider a graph where you add an edge after a test, and argue that if the graph is not dense the output is not uniquely determined.
Sorting solutions have been proposed, but sorting is a little too much: We don't need order; we just need equality groups.
So hashing would be enough (and faster).
This kind of recursive hash partitioning is actually being done by SQL Server when it needs to hash join or hash aggregate over huge data sets. It distributes its build input stream into many partitions which are independent. This scheme scales to arbitrary amounts of data and multiple CPUs linearly.
You don't need recursive partitioning if you can find a distribution key (hash key) that provides enough buckets that each bucket is small enough to be processed very quickly. Unfortunately, I don't think socks have such a property.
If each sock had an integer called "PairID" one could easily distribute them into 10 buckets according to PairID % 10
(the last digit).
The best real-world partitioning I can think of is creating a rectangle of piles: one dimension is color, the other is the pattern. Why a rectangle? Because we need O(1) random-access to piles. (A 3D cuboid would also work, but that is not very practical.)
Update:
What about parallelism? Can multiple humans match the socks faster?
What about the element distinctness problem? As the article states, the element distinctness problem can be solved in O(N)
. This is the same for the socks problem (also O(N)
, if you need only one distribution step (I proposed multiple steps only because humans are bad at calculations - one step is enough if you distribute on md5(color, length, pattern, ...)
, i.e. a perfect hash of all attributes)).
Clearly, one cannot go faster than O(N)
, so we have reached the optimal lower bound.
Although the outputs are not exactly the same (in one case, just a boolean. In the other case, the pairs of socks), the asymptotic complexities are the same.
What I do is that I pick up the first sock and put it down (say, on the edge of the laundry bowl). Then I pick up another sock and check to see if it's the same as the first sock. If it is, I remove them both. If it's not, I put it down next to the first sock. Then I pick up the third sock and compare that to the first two (if they're still there). Etc.
This approach can be fairly easily be implemented in an array, assuming that "removing" socks is an option. Actually, you don't even need to "remove" socks. If you don't need sorting of the socks (see below), then you can just move them around and end up with an array that has all the socks arranged in pairs in the array.
Assuming that the only operation for socks is to compare for equality, this algorithm is basically still an n2 algorithm, though I don't know about the average case (never learned to calculate that).
Sorting, of course improves efficiency, especially in real life where you can easily "insert" a sock between two other socks. In computing the same could be achieved by a tree, but that's extra space. And, of course, we're back at NlogN (or a bit more, if there are several socks that are the same by sorting criteria, but not from the same pair).
Other than that, I cannot think of anything, but this method does seem to be pretty efficient in real life. :)
If the "move" operation is fairly expensive, and the "compare" operation is cheap, and you need to move the whole set anyway, into a buffer where search is much faster than in original storage... just integrate sorting into the obligatory move.
I found integrating the process of sorting into hanging to dry makes it a breeze. I need to pick up each sock anyway, and hang it (move) and it costs me about nothing to hang it in a specific place on the strings. Now just not to force search of the whole buffer (the strings) I choose to place socks by color/shade. Darker left, brighter right, more colorful front etc. Now before I hang each sock, I look in its "right vicinity" if a matching one is there already - this limits "scan" to 2-3 other socks - and if it is, I hang the other one right next to it. Then I roll them into pairs while removing from the strings, when dry.
Now this may not seem all that different from "forming piles by color" suggested by top answers but first, by not picking discrete piles but ranges, I have no problem classifying whether "purple" goes to "red" or "blue" pile; it just goes between. And then by integrating two operations (hang to dry and sort) the overhead of sorting while hanging is like 10% of what separate sorting would be.
I have taken simple steps to reduce my effort into a process taking O(1) time.
By reducing my inputs to one of two types of socks (white socks for recreation, black socks for work), I only need to determine which of two socks I have in hand. (Technically, since they are never washed together, I have reduced the process to O(0) time.)
Some upfront effort is required to find desirable socks, and to purchase in sufficient quantity as to eliminate need for your existing socks. As I'd done this before my need for black socks, my effort was minimal, but mileage may vary.
Such an upfront effort has been seen many times in very popular and effective code. Examples include #DEFINE'ing pi to several decimals (other examples exist, but that's the one that comes to mind right now).