问题
I am looking to implement a VBA trie-building algorithm that is able to process a substantial English lexicon (~50,000 words) in a relatively short amount of time (less than 15-20 seconds). Since I am a C++ programmer by practice (and this is my first time doing any substantial VBA work), I built a quick proof-of-concept program that was able to complete the task on my computer in about half a second. When it came time to test the VBA port however, it took almost two minutes to do the same -- an unacceptably long amount of time for my purposes. The VBA code is below:
Node Class Module:
Public letter As String
Public next_nodes As New Collection
Public is_word As Boolean
Main Module:
Dim tree As Node
Sub build_trie()
Set tree = New Node
Dim file, a, b, c As Integer
Dim current As Node
Dim wordlist As Collection
Set wordlist = New Collection
file = FreeFile
Open "C:\corncob_caps.txt" For Input As file
Do While Not EOF(file)
Dim line As String
Line Input #file, line
wordlist.add line
Loop
For a = 1 To wordlist.Count
Set current = tree
For b = 1 To Len(wordlist.Item(a))
Dim match As Boolean
match = False
Dim char As String
char = Mid(wordlist.Item(a), b, 1)
For c = 1 To current.next_nodes.Count
If char = current.next_nodes.Item(c).letter Then
Set current = current.next_nodes.Item(c)
match = True
Exit For
End If
Next c
If Not match Then
Dim new_node As Node
Set new_node = New Node
new_node.letter = char
current.next_nodes.add new_node
Set current = new_node
End If
Next b
current.is_word = True
Next a
End Sub
My question then is simply, can this algorithm be sped up? I saw from some sources that VBA Collection
s are not as efficient as Dictionary
s and so I attempted a Dictionary
-based implementation instead but it took an equal amount of time to complete with even worse memory usage (500+ MB of RAM used by Excel on my computer). As I say I am extremely new to VBA so my knowledge of both its syntax as well as its overall features/limitations is very limited -- which is why I don't believe that this algorithm is as efficient as it could possibly be; any tips/suggestions would be greatly appreciated.
Thanks in advance
NB: The lexicon file referred to by the code, "corncob_caps.txt", is available here (download the "all CAPS" file)
回答1:
There are a number of small issues and a few larger opportunities here. You did say this is your first vba work, so forgive me if I'm telling you things you already know
Small things first:Dim file, a, b, c As Integer
declares file, a and b as variants. Integer
is 16 bit sign, so there may be risk of overflows, use Long
instead.
DIM
'ing inside loops is counter-productive: unlike C++ they are not loop scoped.
The real opportunity is:
Use For Each
where you can to iterate collections: its faster than indexing.
On my hardware your original code ran in about 160s. This code in about 2.5s (both plus time to load word file into the collection, about 4s)
Sub build_trie()
Dim t1 As Long
Dim wd As Variant
Dim nd As Node
Set tree = New Node
' Dim file, a, b, c As Integer : declares file, a, b as variant
Dim file As Integer, a As Long, b As Long, c As Long ' Integer is 16 bit signed
Dim current As Node
Dim wordlist As Collection
Set wordlist = New Collection
file = FreeFile
Open "C:\corncob_caps.txt" For Input As file
' no point in doing inside loop, they are not scoped to the loop
Dim line As String
Dim match As Boolean
Dim char As String
Dim new_node As Node
Do While Not EOF(file)
'Dim line As String
Line Input #file, line
wordlist.Add line
Loop
t1 = GetTickCount
For Each wd In wordlist ' for each is faster
'For a = 1 To wordlist.Count
Set current = tree
For b = 1 To Len(wd)
'Dim match As Boolean
match = False
'Dim char As String
char = Mid$(wd, b, 1)
For Each nd In current.next_nodes
'For c = 1 To current.next_nodes.Count
If char = nd.letter Then
'If char = current.next_nodes.Item(c).letter Then
Set current = nd
'Set current = current.next_nodes.Item(c)
match = True
Exit For
End If
Next nd
If Not match Then
'Dim new_node As Node
Set new_node = New Node
new_node.letter = char
current.next_nodes.Add new_node
Set current = new_node
End If
Next b
current.is_word = True
Next wd
Debug.Print "Time = " & GetTickCount - t1 & " ms"
End Sub
EDIT:
loading the word list into a dynamic array will reduce load time to sub second. Be aware that Redim Preserve is expensive, so do it in chunks
Dim i As Long, sz As Long
sz = 10000
Dim wordlist() As String
ReDim wordlist(0 To sz)
file = FreeFile
Open "C:\corncob_caps.txt" For Input As file
i = 0
Do While Not EOF(file)
'Dim line As String
Line Input #file, line
wordlist(i) = line
i = i + 1
If i > sz Then
sz = sz + 10000
ReDim Preserve wordlist(0 To sz)
End If
'wordlist.Add line
Loop
ReDim Preserve wordlist(0 To i - 1)
then loop through it like
For i = 0 To UBound(wordlist)
wd = wordlist(i)
回答2:
I'm out of practice with VBA, but IIRC, iterating the Collection using For Each should be a bit faster than going numerically:
Dim i As Variant
For Each i In current.next_nodes
If i.letter = char Then
Set current = i
match = True
Exit For
End If
Next node
You're also not using the full capabilities of Collection. It's a Key-Value map, not just a resizeable array. You might get better performance if you use the letter as a key, though looking up a key that isn't present throws an error, so you have to use an ugly error workaround to check for each node. The inside of the b loop would look like:
Dim char As String
char = Mid(wordlist.Item(a), b, 1)
Dim node As Node
On Error Resume Next
Set node = Nothing
Set node = current.next_nodes.Item(char)
On Error Goto 0
If node Is Nothing Then
Set node = New Node
current.next_nodes.add node, char
Endif
Set current = node
You won't need the letter variable on class Node that way.
I didn't test this. I hope it's all right...
Edit: Fixed the For Each loop.
Another thing you can do which will possibly be slower but will use less memory is use an array instead of a collection, and resize with each added element. Arrays can't be public on classes, so you have to add methods to the class to deal with it:
Public letter As String
Private next_nodes() As Node
Public is_word As Boolean
Public Sub addNode(new_node As Node)
Dim current_size As Integer
On Error Resume Next
current_size = UBound(next_nodes) 'ubound throws an error if the array is not yet allocated
On Error GoTo 0
ReDim next_nodes(0 To current_size) As Node
Set next_nodes(current_size) = new_node
End Sub
Public Function getNode(letter As String) As Node
Dim n As Variant
On Error Resume Next
For Each n In next_nodes
If n.letter = letter Then
Set getNode = n
Exit Function
End If
Next
End Function
Edit: And a final optimization strategy, get the Integer char value with the Asc function and store that instead of a String.
回答3:
You really need to profile it, but if you think Collections are slow maybe you can try using dynamic arrays?
来源:https://stackoverflow.com/questions/7692900/is-there-any-way-i-can-speed-up-this-vba-algorithm