问题
So here's the situation:
I have millions, possibly billions, of strings that I am trying to parse and put into a sorted structure, lets say I have 5,000,000 strings. I'm trying to write a fast program that can put all of these strings from an unsorted vector into an ordered data structure that can also search the structure fast, thus the reasoning for the AVL tree (which eventually I plan to use a hash table of a-z for even faster lookup, but that comes later). I get all of the strings into a vector first, but they are all jumbled up, unsorted and different lengths. I don't want any repeated strings in my tree, so if the program finds the strings "hello" and "hello" it would only have one AVL entry and would increment an integer holder for the frequency that this string has appeared.
So my question is this: would it be faster to sort the vector first (using something fast like a multi-threaded quicksort or something else) and then input it into the AVL tree, after all the words are sorted together with other same words, OR is it faster to just put all the data from the unsorted vector into the AVL tree, and continously checking the AVL tree for whether or not a word already exists, then incrementing it.
So to picture it in an order of operations here are the two cases:
CASE A:
> Get input/parse strings
> Put strings into vector (unsorted)
> Put vector into array or linked-list
> Quicksort that array/llist
> Input that sorted array into the AVL Tree
CASE B:
> Get input/parse strings
> Put strings into vector (unsorted)
> Insert vector data into AVL tree
> During insertion, check if there are duplicate words, if so, increment the counter
Which case is faster??
--EDIT-- So after hearing some of the comments, inserting a sorted array into an AVL tree from the beginning would be a bad idea, which makes sense due to how many rotations would be made. It seems that directly inserting into the AVL tree is probably a good idea, but what is the best way to efficiently insert when a word is already in the tree somewhere? How can I make sure that I find it? Is that where my sorting can come in?
回答1:
Think of the way balancing works for AVL trees. It works best if the "middle values" come first. With a sorted input, you will need a lot of re-balancing, thus pre-sorting will probably do more harm than good.
For example, consider the following AVL tree holding the values 1-6:
4
/ \
2 5
/ \ \
1 3 6
If the input order is 4, 2, 5, 1, 3, 6
, you'll never need to balance the tree. In contrast, for a sorted input 1, 2, 3, 4, 5, 6
, you'll need many re-balancing operations:
1 +3 2 +4 2 +5 2 +6 3
\ ---> / \ ---> / \ ---> / \ ---> / \
2 1 3 1 3 1 4 2 5
\ / \ / / \
4 3 5 1 4 6
UPDATE The original question was whether sorting data before inserting into an AVL tree would improve performance. Now the OP edited the question, shifting towards his specific problem.
but what is the best way to efficiently insert when a word is already in the tree somewhere? How can I make sure that I find it? Is that where my sorting can come in?
The whole point of an AVL tree is to efficiently find data, so I don't understand the question. It should be obvious how to traverse the binary search tree to find a value. Why would you want to sort data for that?
Please note that binary search trees are a good data structure to store keys, but it can also manage arbitrary data associated with these keys. In your case, you want to store a count along with your keys. Therefore, you don't need a tree of words/strings, but a tree of pairs (string, integer) that represent the word and its count. For the tree order, just consider the string key, i.e. the word.
For each word to insert, look it up in the tree. If it already exists, update the word count. Otherwise, insert a new pair with a word count of one.
A final note: The C++ standard library comes with a map
type that is usually (always?) implemented using a balancing tree (AVL or red-black). You'll save yourself a lot of work and bug-fixing by just using this implementation. Since C++11 there is also un unordered_map
, usually (always?) implemented using a hash table.
回答2:
I'll convert my comment into an answer.
If the set of strings is pre-defined, that is you are not going to add more strings to it after the initial load, then the fastest is probably not to use AVL tree (or any other tree) at all.
Just load the strings into std::vector
, sort it (O(N*logN), remove uniques (std::uniq
, O(N)) and then for lookup use std::lower_bound
(O(logN)).
Having the same complexity as AVL tree, most likely in practice it will be faster, because of increased cache-friendlyness.
回答3:
This following might not be any faster in the real world.
When inserting your sorted vector into the AVL tree, insert it like it is a tree itself. First insert the middle then recursively the middle of the left part and middle of the right part, and so on. If all the values in the vector are spread uniformly then you won't have to rebalance your tree.( in theory. )
Better still, you can construct your own tree out of the sorted vector( if you control the internal memory ) or just use binary search on it in the first place.
The only way to get a objective answer is to test and measure.
回答4:
1-Insertion in AVL tree is O(Log n). Sorting is O(nLogN), So sorting before insertions will decrease performance. 2-For counting purpose, you can use a hash table to find the number of occurrences of every word. Loop through all the words, update the count for every word in the hash table, then insert the words in the AVL tree using the hash table to check if the word was inserted or not and if not to insert it with its associated count.
回答5:
"but what is the best way to efficiently insert when a word is already in the tree somewhere? How can I make sure that I find it? Is that where my sorting can come in?"
why wouldn't you use a map when: key = word, value = word index
that way you will gain access whenever the word exists and you will have the index to manipulate it
来源:https://stackoverflow.com/questions/27100192/which-is-faster-sorting-a-vector-then-putting-it-into-an-avl-tree-or-inputtin