Stack overflow in Prolog DCG grammar rule: how to handle large lists efficiently or lazily

筅森魡賤 提交于 2019-11-26 12:32:23
false

As a general remark you will find more on SO about it under the name library(pio). Also the way to use it cleanly is rather:

:- use_module(library(pio)).

Your example is way too complex, so I will only consider a slightly simpler case, a newline separated list of numbers:

nats([]) --> [].
nats([N|Ns]) --> nat(N), newline, nats(Ns).

So, how can you test this effectively? (That's your Question 3) The basic point of library(pio) is that you can use regular DCGs for file processing. But for testing in the small you can still use the simple phrase/2. So I do:

?- phrase(nats(Ns),"1\n").
Ns = [1] ;
false.

Did you see the ; prompt? That means: Prolog was not able to decide whether or not further answers might be computed - so it leaves one or more choice-points open. And that only for a single digit You can imagine how things will pile up.

Let's dig deeper:

?- phrase(digit(D),"1").
D = 1 ;
false.

Again the evil ;! In order to make this work, everything would have to be determinate. There are three ways to this:

Use cuts (and lose your soul)

I wish you luck - the best seems to be just after the repeating element:

trace_file_phrase([]) --> [].
trace_file_phrase([T|Ts]) -->
   trace_phrase(T),
   !, % ugly, but...
   trace_file_phrase(Ts).

(This should answer Question 1)

But, hold a minute! What is so bad about this !? As long, as there is exactly one answer to trace_phrase//1 things are perfect. It is only, if there are more answers (or actually solutions), that the cut might remove precious answers. How do you know, if there are more solutions? Well, you don't. And you won't see them as they have been cut away already.

call_semidet/1

Here is a way to ensure that this does not happen. This works only for side-effect free goals that can be called twice without any effect:

call_semidet(Goal) :-
   (  call_nth(Goal, 2)
   -> throw(error(mode_error(semidet,Goal),_))
   ;  once(Goal)
   ).

This uses call_nth/2, as defined in another post. (As an optimization, the implementation could avoid calling Goal twice when there is no choice-point open...) Just to make clear, how it works:

?- phrase(nat(N),"1234").
N = 1234 ;
false.

?- call_semidet(phrase(nat(N),"1234")).
N = 1234.

?- call_semidet((X=1;X=2)).
ERROR: Unknown error term: mode_error(semidet, (2=1;2=2))

So it makes your little grammar effectively determinate! There is thus no need to reformulate anything!

What is lacking now is some integration of this into the grammar. You can do this very low-level, or rather cleanly using library(lambda).

phrase_semidet(NT) -->
   call(S0^S^call_semidet(phrase(NT,S0,S))).

Note that in this very particular case we do not use the \ for renaming.

trace_file_phrase([]) --> [].
trace_file_phrase([T|Ts]) -->
   phrase_semidet(trace_phrase(T)),
   trace_file_phrase(Ts).

Exploit indexing

Finally, a very laborious but clean way would be to rewrite everything to profit better from indexing (and maybe help to improve indexing in general...) But this is a long road. Just to start with:

digit(D) --> [C],
   {c_digit(C,D)}.

c_digit(0'0,0).
c_digit(0'1,1).
c_digit(0'2,2).
c_digit(0'3,3).
c_digit(0'4,4).
c_digit(0'5,5).
c_digit(0'6,6).
c_digit(0'7,7).
c_digit(0'8,8).
c_digit(0'9,9).

This gives you now:

?- phrase(digit(D),"1").
D = 1.

But you have another source of nondeterminism which is rather due to the way you define the grammar. In nat//2 you see it:

nat(N,N) --> [].
nat(A,N) --> digit(D), ... .

The first rule always applies, that is, "1234\n" will be parsed as "1" "12" "123" "1234" only the following newline//0 realizes that it would suffice go for the last — and then stick to it.

You can rewrite things for that, but then, the code is no longer the pure little spec you liked, isn't it? Well, maybe things might improve in the future.

E.g. indexing is much better in SWI than it used to be, maybe also things evolve here too....

The intention of library(pio) was to get this process started. Compare this to Haskell — we are far away from interact efficiency-wise! But there is no inherent cost:

... --> [] | [_], ... .

?- phrase_from_file((...,"searchstring",...),fichier).

is just as efficient as grep — space-wise. That is, it runs in constant space. So hopefully more code will run better in the future.

Edit: BTW, library(pio) did already have an impact efficiency-wise: GC phases were improved significantly, very much in the same manner as Wadler's Fixing some space leak – paper a quarter century ago. Things evolve ...

I've verified the stackoverflow on a 2Mb file. Then I rewrote the grammar using library(dcg/basics), and now is working.

:- [library(dcg/basics)].

load_trace_0(Filename, Ls) :-
    phrase_from_file(lines(Ls), Filename).

lines([s(H,I)|R]) -->
    "s 0x", xinteger(H), " ",
    integer(I), blanks,
    !, lines(R).
lines([l(H,I)|R]) -->
    "l 0x", xinteger(H), " ",
    integer(I), blanks,
    !, lines(R).
lines([]) --> [].

But then I tried to put the cut on your grammar, and is working as well. So the answer from @gusbro (+1) solves your problem.

About the efficiency problem:

If your input is usually well formed, then I think you should swap the clauses of nat/4 and hexnum/4, so they would read:

nat(A,N) --> digit(D), { A1 is A*10 + D }, nat(A1, N).
nat(N,N) --> [].

hexnum(A, N) --> hexdigit(D), { A1 is A*16 + D }, hexnum(A1, N).
hexnum(N, N) --> [].

because you only want to stop parsing a number when there are no more digits to consume.

If used wisely, the cut (!) can help you performance-wise and also regarding the stack overflow because it prunes the prolog evaluation tree. For example, you may commit (!) at the end of trace_file_phrase/3 (that is, after the newline) because you don't need to reparse that part of the input again to find other solutions.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!