The brute force way can solve the problem in O(n!), basically calculating all the permutations and checking the results in a dictionary. I am looking for ways to improve the com
Here is the algorithm that will find all words that can be formed from a set of letters in O(1)
. We will represent words with their spectra and store them in a prefix tree (aka trie).
The spectrum of a word W
is an array S
of size N
, such that S(i)
is the number of occurrences (aka frequency) of an A(i)
letter in the word W
, where A(i)
is the i
-th letter of a chosen alphabet and N
is its size.
For example, in the English alphabet, A(0)
is A
, A(1)
is B
, ... , A(25)
is Z
. A spectrum of the word aha
is <2,0,0,0,0,0,0,1,0,...,0>
.
We will store the dictionary in a prefix trie, using spectrum as a key. The first token of a key is the frequency of letter A
, the second is the frequency of letter B
and so on. (From here and below we will use the English alphabet as an example).
Once formed, our dictionary will be a tree with the height 26
and width that varies with each level, depending on a popularity of the letter. Basically, each layer will have a number of subtrees that is equal to the maximum word frequency of this letter in the provided dictionary.
Since our task is not only to decide whether we can build a word from the provided set of characters but also to find these words (a search problem), then we need to attach the words to their spectra (as spectral transformation is not invertible, consider spectra of words read
and dear
). We will attach a word to the end of each path that represents its spectrum.
To find whether we can build a word from a provided set we will build a spectrum of the set, and find all paths in the prefix trie with the frequencies bounded by the corresponding frequencies of the set's spectrum. (Note, we are not forcing to use all letters from the set, so if a word uses fewer letters, then we can build it. Basically, our requirement is that for all letters in the word the frequency of a letter should be less than or equal than a frequency of the same letter in the provided set).
The complexity of the search procedure doesn't depend on the length of the dictionary or the length of the provided set. On average, it is equal to 26 times the average frequency of a letter. Given the English alphabet, it is a quite small constant factor. For other alphabets, it might not be the case.
I will provide a reference implementation of an algorithm in OCaml.
The dictionary data type is recursive:
type t = {
dict : t Int.Map.t;
data : string list;
}
(Note: it is not the best representation, probably it is better to represent it is a sum type, e.g., type t = Dict of t Int.Map.t | Data of string list
, but I found it easier to implement it with the above representation).
We can generalize the algorithm by a spectrum function, either using a functor, or by just storing the spectrum function in the dictionary, but for the simplicity, we will just hardcode the English alphabet in the ASCII representation,
let spectrum word =
let index c = Char.(to_int (uppercase c) - to_int 'A') in
let letters = Char.(to_int 'Z' - to_int 'A' + 1) in
Array.init letters ~f:(fun i ->
String.count word ~f:(fun c -> index c = i))
Next, we will define the add_word
function of type dict -> string -> dict
, that will add a new path to our dictionary, by decomposing a word to its spectrum, and adding each constituent. Each addition will require exactly 26
iterations, not including the spectrum computation. Note, the implementation is purely functional, and doesn't use any imperative features. Every time the function add_word
returns a new data structure.
let add_word dict word =
let count = spectrum word in
let rec add {dict; data} i =
if i < Array.length count then {
data;
dict = Map.update dict count.(i) ~f:(function
| None -> add empty (i+1)
| Some sub -> add sub (i+1))
} else {empty with data = word :: data} in
add dict 0
We are using the following definition of the empty
value in the add
function:
let empty = {dict = Int.Map.empty; data=[]}
Now let's define the is_buildable
function of type dict -> string -> bool
that will decide whether the given set of characters can be used to build any word in the dictionary. Although we can express it via the search, by checking the size of the found set, we would still prefer to have a specialized implementation, as it is more efficient and easier to understand. The definition of the function follows closely the general description provided above. Basically, for every character in the alphabet, we check whether there is an entry in the dictionary with the frequency that is less or equal than the frequency in the building set. If we checked all letters, then we proved, that we can build at least one word with the given set.
let is_buildable dict set =
let count = spectrum set in
let rec find {dict} i =
i >= Array.length count ||
Sequence.range 0 count.(i) ~stop:`inclusive |>
Sequence.exists ~f:(fun cnt -> match Map.find dict cnt with
| None -> false
| Some dict -> find dict (i+1)) in
find dict 0
Now, let's actually find the set of all words, that are buildable from the provided set:
let build dict set =
let count = spectrum set in
let rec find {dict; data} i =
if i < Array.length count then
Sequence.range 0 count.(i) ~stop:`inclusive |>
Sequence.concat_map ~f:(fun cnt -> match Map.find dict cnt with
| None -> Sequence.empty
| Some dict -> find dict (i+1))
else Sequence.of_list data in
find dict 0
We will basically follow the structure of the is_buildable
function, except that instead of proving that such a frequency exists for each letter, we will collect all the proofs by reaching the end of the path and grabbing the set of word attached to it.
For the sake of completeness, we will test it by creating a small program, that will read a dictionary, with each word on a separate line, and interact with a user, by asking for a set and printing the resultion set of words, that can be built from it.
module Test = struct
let run () =
let dict =
In_channel.(with_file Sys.argv.(1)
~f:(fold_lines ~init:empty ~f:add_word)) in
let prompt () =
printf "Enter characters and hit enter (or Ctrl-D to stop): %!" in
prompt ();
In_channel.iter_lines stdin ~f:(fun set ->
build dict set |> Sequence.iter ~f:print_endline;
prompt ())
end
Here comes and example of interaction, that uses /usr/share/dict/american-english
dictionary available on my machine (Ubunty Trusty).
./scrabble.native /usr/share/dict/american-english
Enter characters and hit enter (or Ctrl-D to stop): read
r
R
e
E
re
Re
Er
d
D
Rd
Dr
Ed
red
Red
a
A
Ra
Ar
era
ear
are
Rae
ad
read
dear
dare
Dare
Enter characters and hit enter (or Ctrl-D to stop):
(Yep, the dictionary contains words, that like r
and d
that are probably not true English words. In fact, for each letter the dictionary has a word, so, we can basically build a word from each non-empty set of alphabet letters).
The full implementation along with the building instructions can be found on Gist