问题
I would like to have a little clarification on the definiton of a bucket in SAS hashtable. The question is exactly about the hashexp parameter.
According to the SAS DOCs, hashexp is:
The hash object's internal table size, where the size of the hash table is 2n.
The value of HASHEXP is used as a power-of-two exponent to create the hash table size. For example, a value of 4 for HASHEXP equates to a hash table size of 24, or 16. The maximum value for HASHEXP is 20.
The hash table size is not equal to the number of items that can be stored. Imagine the hash table as an array of 'buckets.' A hash table size of 16 would have 16 'buckets.' Each bucket can hold an infinite number of items. The efficiency of the hash table lies in the ability of the hashing function to map items to and retrieve items from the buckets.
You should set the hash table size relative to the amount of data in the hash object in order to maximize the efficiency of the hash object lookup routines. Try different HASHEXP values until you get the best result. For example, if the hash object contains one million items, a hash table size of 16 (HASHEXP = 4) would work, but not very efficiently. A hash table size of 512 or 1024 (HASHEXP = 9 or 10) would result in the best performance.
The question is what exactly is a hash table size, while it is not a amount of data in the hash object?
Should it be understood as if we wanted to allocate as much memory as it may be neccessary but not less, no more. It is a power of two to get things work fast. But it does not limit the amount of data possibly used, it only indicates about how much is going to be used, right?
回答1:
Paul Dorfman (the master of hashing) goes into a fair bit of detail on page 10 of this whitepaper:
http://www2.sas.com/proceedings/forum2008/037-2008.pdf
As I understand it, hashtables store their data in binary trees. Each bucket created by hashexp represents the number of binary trees that will be used to store the data. A hashexp of 0 would use a single tree, while a hashexp of 8 would use 256 trees. When a lookup is performed against the hash object, an internal algorithm determines which tree the key should exist in (based on the hashed value). It then checks that tree for the value. By automatically knowing which of the 256 trees to look in (for example) it would have saved itself 8 comparisons (2^8) when compared to a single binary tree.
The whole thing seems a lot more complex than that but that's my interpretation of why it works out faster.
回答2:
As Rob Penridge pointed out, Paul Dorfman is indeed the SAS Hash Object Guru. Hashexp is not related to the size of the hash table, again as mentioned in Rob's answer.
If you have a table with 100obs and 10 numeric variables which is loaded into a hash table, then size of the hash table is simply 100obs*10vars*8bytes(assuming all numeric vars are stored as 8byte fields) 7.8KB give or take a 10%.
Remember that SAS dynamically allocates RAM space as records are added to the Hash table in memory, so you do not need to specify in advance what size it should be.[I've been using hash tables regularly, but cant think of any place where one can specify the size in advance].
General tip: if you want to know how big your hash table is going to be, run a PROC CONTENTS on the dataset you want to load into Hash table and multiply "Observation Length" & "No. of obs in dataset", this will give the memory size needed in bytes. If you have that much memory then you can load it into memory.
来源:https://stackoverflow.com/questions/11359103/what-exactly-is-table-size-in-sas-hashtable-specified-by-hashexp