Erlang: What is most-wrong with this trie implementation?

青春壹個敷衍的年華 提交于 2019-12-07 06:17:27

问题


Over the holidays, my family loves to play Boggle. Problem is, I'm terrible at Boggle. So I did what any good programmer would do: wrote a program to play for me.

At the core of the algorithm is a simple prefix trie, where each node is a dict of references to the next letters.

This is the trie:add implementation:

add([], Trie) ->
    dict:store(stop, true, Trie);

add([Ch|Rest], Trie) ->
    % setdefault(Key, Default, Dict) ->
    %     case dict:find(Key, Dict) of
    %         { ok, Val } -> { Dict, Val }
    %         error -> { dict:new(), Default }
    %     end.
    { NewTrie, SubTrie } = setdefault(Ch, dict:new(), Trie),
    NewSubTrie = add(Rest, SubTrie),
    dict:store(Ch, NewSubTrie, NewTrie).

And you can see the rest, along with an example of how it's used (at the bottom), here:

http://gist.github.com/263513

Now, this being my first serious program in Erlang, I know there are probably a bunch of things wrong with it… But my immediate concern is that it uses 800 megabytes of RAM.

So, what am I doing most-wrong? And how might I make it a bit less-wrong?


回答1:


You could implement this functionality by simply storing the words in an ets table:

% create table; add words
> ets:new(words, [named_table, set]).
> ets:insert(words, [{"zed"}]).  
> ets:insert(words, [{"zebra"}]).    

% check if word exists
> ets:lookup(words, "zed").          
[{"zed"}]

% check if "ze" has a continuation among the words
78> ets:match(words, {"ze" ++ '$1'}).
[["d"],["bra"]]

If trie is a must, but you can live with a non-functional approach, then you can try digraphs, as Paul already suggested.

If you want to stay functional, you might save some bytes of memory by using structures using less memory, for example proplists, or records, such as -record(node, {a,b,....,x,y,z}).




回答2:


I don't remember how much memory a dict takes, but let's estimate. You have 2.5e6 characters and 2e5 words. If your trie had no sharing at all, that would take 2.7e6 associations in the dicts (one for each character and each 'stop' symbol). A simple purely-functional dict representation would maybe 4 words per association -- it could be less, but I'm trying to get an upper bound. On a 64-bit machine, that'd take 8*4*2.7 million bytes, or 86 megabytes. That's only a tenth of your 800M, so something's surely wrong here.

Update: dict.erl represents dicts with a hashtable; this implies lots of overhead when you have a lot of very small dicts, as you do. I'd try changing your code to use the proplists module, which ought to match my calculations above.




回答3:


An alternative way to solve the problem is going through the word list and see if the word can be constructed from the dice. That way you need very little RAM, and it might be more fun to code. (optimizing and concurrency)




回答4:


Look into DAWGs. They're much more compact than tries.




回答5:


I don't know about your algorithm, but if you're storing that much data, maybe you should look into using Erlang's built-in digraph library to represent your trie, instead of so many dicts. http://www.erlang.org/doc/man/digraph.html




回答6:


If all words are in English, and the case doesn't matter, all characters can be encoded by numbers from 1 to 26 (and in fact, in Erlang they are numbers from 97 to 122), reserving 0 for stop. So you can use the array module as well.



来源:https://stackoverflow.com/questions/1964015/erlang-what-is-most-wrong-with-this-trie-implementation

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!