What makes a DCG predicate expensive?

纵然是瞬间 提交于 2019-12-11 01:57:09

问题


I'm building a Definite Clause Grammar to parse 20,000 pieces of semi-natural text. As the size of my database of predicates grows (now up to 1,200 rules), parsing a string can take quite a long time -- particularly for strings that are not currently interpretable by the DCG, due to syntax I haven't yet encoded. The current worst-case is 3 minutes for a string containing 30 words. I'm trying to figure out how I can optimize this, or if I should just start researching cloud computing.

I'm using SWI-Prolog, and that provides a "profile" goal, which provides some statistics. I was surprised to find that the simplest rules in my database are taking up the majority of execution time. My corpus contains strings that represent numbers, and I want to capture these in a scalar/3 predicate. These are hogging ~50-60% of total execution time.

At the outset, I had 70 lines in my scalars.pl, representing the numeric and natural language representations of the numbers in my corpus. Like so:

scalar(scalar(3)) --> ["three"].
scalar(scalar(3)) --> ["3"].
scalar(scalar(4)) --> ["four"].
scalar(scalar(4)) --> ["4"].

...and so on.

Thinking that the length of the file was the problem, I put in a new rule that would automatically parse any numeric representations:

scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.

Thanks to that, I've gone from 70 rules to 31, and helped a bit -- but it wasn't a huge savings. Is there anything more that can be done? My feeling is maybe not, because what could be simpler than a single atom in a list?

These scalars are called in a lot of places throughout the grammar, and I assume that's the root of the issue. Though they're simple rules, they're everywhere, and unavoidably so. A highly general grammar just won't work for my application, and I wouldn't be surprised if I end up with 3,000 rules or more.

I've never built a DCG this large, so I'm not sure how much I can expect in terms of performance. Happy to take any kind of advice on this one: is there some other way of encoding these rules? Should I accept that some parses will take a long time, and figure out how to run parses in parallel?

Thank you in advance!

EDIT: I was asked to provide a reproducible example, but to do that I'd have to link SO to the entire project, since this is an issue of scale. Here's a toy version of what I'm doing for the sake of completeness. Just imagine there were large files describing hundreds of nouns, hundreds of verbs, and hundreds of syntactic structures.

sent(sent(VP, NP)) --> vp(VP), np(NP).
vp(vp(V)) --> v(V).
np(np(Qty, Noun)) --> qty(Qty), n(Noun).
scalar(scalar(3)) --> ["three"].
scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.

qty(qty(Scalar)) --> scalar(Scalar).
v(v(eat)) --> ["eat"].
n(n(pie)) --> ["pie"].

回答1:


One aspect of your program that you might investigate is to make sure individual predicates succeed quickly and fail quickly. This is particularly useful to check for predicates that have many clauses.

For instance, when scalar(X) is evaluated on a token that is not a scalar then the program will have to try 31 (by your last count) times before it can determine that scalar//1 fails. If the structure of your program is such that scalar(X) is checked against every token then this could be very expensive.

Further, if scalar(X) does happen to find that a token matches but a subsequent goal fails then it appears that your program will retry the scalar(X) until all of the scalar//1 clauses have been attempted.

The judicious use of cut (!) or if-then-else (C1->G1;C2->G2;G3) can provide a tremendous performance improvement. Or you can structure your predicates so that they rely on indexing to select the appropriate clause. E.g.:

scalar(scalar(N)) --> [Token], {scalar1(Token, scalar(N))}.

scalar1("3", scalar(3)) :- !.
scalar1(Y, scalar(X)) :- atom_number(Y, X).

This uses both cut and clause indexing (if the compiler provides it) with the scalar1/1 predicate.

EDIT: You should read R. A. O'Keefe's The Craft of Prolog. It is an excellent guide to the practical aspects of Prolog.




回答2:


Here's how I've tackled performance and optimization problems as a novice Prologer.

1.) Introduce timeouts to your application. I'm calling Prolog via the subprocess module in Python 3.6, and that allows you to set a timeout. As I've worked with my code base more, I've got a pretty good sense of how long a successful parse might take, and can assume anything taking longer is not going to work.

2.) Make use of the graphical profiler that's packaged in the swi-prolog IDE. This gives a lot more insight, as you can bounce around the call tree. I found it particularly helpful to sort predicates by the execution time of their children. Before I was thinking about it like pollution in a river. "Man, there's a lot of junk floating in here," I thought, not considering that upstream some factories were contributing a lot of that junk.

As for how to optimize a DCG without hurting the semantics & expressivity of one's grammar, I think that will have to be a question for another Stack Overflow. And as for my initial question, that's still an open one -- predicates that seem simple (to me) take quite a while.



来源:https://stackoverflow.com/questions/57130192/what-makes-a-dcg-predicate-expensive

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!