I was playing with permutation
in a couple of programs and stumbled upon this little experiment:
Permutation method 1:
permute([], []).
I suspect what triggered this investigation was the discussion about tail-recursive sum/2 using an accumulator versus not. The sum/2
example is very cut-and-dry; one version is doing the arithmetic on the stack, the other is using an accumulator. However, like most things in the real world, the general truth is "it depends." For instance, compare the efficiency of methods 1 and 2 using full instantiation:
?- time(permute([1,2,3,4,5,6,7,8,9], [1,2,3,4,5,6,7,8,9])).
% 18 inferences, 0.000 CPU in 0.000 seconds (66% CPU, 857143 Lips)
true ;
% 86,546 inferences, 0.022 CPU in 0.022 seconds (100% CPU, 3974193 Lips)
false.
?- time(permute([1,2,3,4,5,6,7,8,9], [1,2,3,4,5,6,7,8,9])).
% 18 inferences, 0.000 CPU in 0.000 seconds (62% CPU, 857143 Lips)
true ;
% 47 inferences, 0.000 CPU in 0.000 seconds (79% CPU, 940000 Lips)
false.
Method 1 beats method 2 when you're generating solutions (as in your tests), but method 2 beats method 1 when you're simply checking. Looking at the code it's easy to see why: the first one has to re-permute the whole tail of the list, while the second one just has to try selecting out one item. In this case it may be easy to point to the generating case and say it's more desired. That determination is simply one of the tradeoffs one must keep track of when dealing with Prolog. It's very difficult to make predicates that are all things to all people and always perform great; you must decide which are the "privileged paths" and which are not.
I do vaguely recall someone recently showed an example of appending lists "during the return" and how you could take something that isn't or shouldn't be tail recursive and make it work thanks to unification, but I don't have the link handy. Hopefully whoever brought it up last time (Will?) will show up and share it.
Great question, by the way. Your investigation method is valid, you'll just need to take into account other instantiation patterns as well. Speaking personally, I usually try to worry harder about correctness and generality than performance up-front. If I see immediately how to use an accumulator instead I will, but otherwise I won't do it that way until I run into an actual need for better performance. Tail recursion is just one method for improving performance; frequently there are other things that need to be addressed as badly or worse.
Really nice question.
Waiting for someone to post a time/space analysis, the only caveat I can offer is that method 1 & 2 don't terminate when first argument is free, while method 3 does.
Anyway, method 1 seems really much more efficient than the builtin. Good to know.
edit: and given that the library implementation merely adjust instantiation of arguments and calls method 1, I'm going to discuss on SWI-Prolog mailing list your method 2 as alternative (or, you prefer to do it yourself, let me know).
more edit: I forgot previously to point out that permutation/3 (let's say, method 2), gives lexicographically ordered solutions, while method 1 doesn't. I think that could be a strong preferential requirement, but should be expressed as an option, given the performance gain that method 1 allows.
?- time(call_nth(permute1([0,1,2,3,4,5,6,7,8,9],P),1000000)).
% 3,112,758 inferences, 3,160 CPU in 3,162 seconds (100% CPU, 984974 Lips)
P = [1, 4, 8, 3, 7, 6, 5, 9, 2|...] .
?- time(call_nth(permute2([0,1,2,3,4,5,6,7,8,9],P),1000000)).
% 10,154,843 inferences, 9,779 CPU in 9,806 seconds (100% CPU, 1038398 Lips)
P = [2, 7, 8, 3, 9, 1, 5, 4, 6|...] .
YAP gives still more gain!
?- time(call_nth(permute1([0,1,2,3,4,5,6,7,8,9],P),1000000)).
% 0.716 CPU in 0.719 seconds ( 99% CPU)
P = [1,4,8,3,7,6,5,9,2,0]
?- time(call_nth(permute2([0,1,2,3,4,5,6,7,8,9],P),1000000)).
% 8.357 CPU in 8.368 seconds ( 99% CPU)
P = [2,7,8,3,9,1,5,4,6,0]
edit: I've posted a comment on SWI-Prolog doc page about this theme.
Really nice question, +1!
Tail call (and, as a special case, tail recursion) optimization only applies if the predicate is deterministic! This is not the case here, so your predicate will always require the local stack space, no matter in which order you place the goals. The non-tail recursive version is more (time-)efficient here when generating all solutions because it needs to do fewer unifications on backtracking.
EDIT: I am expanding on this point since it is well worth studying the performance difference in more detail.
First, for clarity, I rename the two different versions to make clear which version I am talking about:
Variant 1: Non-tail recursive:
permute1([], []).
permute1([X|Rest], L) :-
permute1(Rest, L1),
select(X, L, L1).
Variant 2: Tail-recursive:
permute2([], []).
permute2(L, [P|P1]) :-
select(P, L, L1),
permute2(L1, P1).
Note again that, although the second version is clearly tail recursive, tail call (and hence also tail recursion) optimisation only helps if the predicate is deterministic, and hence cannot help when we generate all permutations, because choice points are still left in that case.
Note also that I am deliberately retaining the original variable naming and main predicate name to avoid introducing more variants. Personally, I prefer a naming convention that makes clear which variables denote lists by appending an s to their names, in analogy to regular English plural. Also, I prefer predicate names that more clearly exhibit the (at least intended and desirable) declarative, relational nature of the code, and recommend to avoid imperative names for this reason.
Consider now unfolding the first variant and partially evaluating it for a list of 3 elements. We start with a simple goal:
?- Xs = [A,B,C], permute1(Xs, L).
and then gradually unfold it by plugging in the definition of permute1/2
, while making all head unifications explicit. In the first iteration, we obtain:
?- Xs = [A,B,C], Xs1 = [B,C], permute1(Xs1, L1), select(A, L, L1).
I am marking the head unifications in bold.
Now, still one goal of permute1/2
is left. So we repeat the process, again plugging in the predicate's only applicable rule body in place of its head:
?- Xs = [A,B,C], Xs1 = [B,C], Xs2 = [C], permute1(Xs2, L2), select(B, L1, L2), select(A, L, L1).
One more pass of this, and we obtain:
?- Xs = [A,B,C], Xs1 = [B,C], Xs2 = [C], select(C, L2, []), select(B, L1, L2), select(A, L, L1).
This is what the original goal looks like if we just unfold the definition of permute1/2
repeatedly.
Now, what about the second variant? Again, we start with a simple goal:
?- Xs = [A,B,C], permute2(Xs, Ys).
One iteration of unfolding permute2/2
yields the equivalent version:
?- Xs = [A,B,C], Ys = [P|P1], select(P, Xs, L1), permute2(L1, P1).
and a second iteration yields:
?- Xs = [A,B,C], Ys = [P|P1], select(P, Xs, L1), Ys1 = [P1|P2], select(P1, L1, L2), permute2(L2, P2).
I leave the third and last iteration as a simple exercise that I strongly recommend you do.
And from this it is clear what we initially probably hadn't expected: A big difference lies in the head unifications, which the first version performs deterministically right at the start, and the second version performs over and over on backtracking.
This famous example nicely shows that, somewhat contrary to common expectation, tail recursion can be quite slow if the code is not deterministic.
Nice example. But I would rather use, it doesn't leave a choice point in permute([], []):
permute3([], []).
permute3([H|T], [P|P1]) :-
select(P, [H|T], L1),
permute3(L1, P1).
Its tail recursive and like 20% faster than permute2/2, but still not as fast as permute1/2.
?- time((permute2([1,2,3,4,5,6,7,8,9,0],_), fail; true)).
% 29,592,302 inferences, 1.653 CPU in 1.667 seconds (99% CPU, 17896885 Lips)
true.
?- time((permute3([1,2,3,4,5,6,7,8,9,0],_), fail; true)).
% 25,963,501 inferences, 1.470 CPU in 1.480 seconds (99% CPU, 17662390 Lips)
true.
But I am not sure whether the explanation by mat is correct. It could be also the case that permute1/2 does less often perform LCO than permute3/2 does.
Namely of n! results of the sub call permute1/2, only the last redo doesn't leave a choice point. On the other hand in permute3/2 every select/3 call has n results and doesn't
leave a choice point in the last redo. I did a little test, write a period for each LCO:
?- permute1([1,2,3],_), fail; nl.
...
?- permute3([1,2,3],_), fail; nl.
..........
LCO has not an extreme benefit in a fail loop. But the Prolog system doesn't know about it. So I guess thats where unnecessary time is spent, to a larger amount in permute3/2.