问题
I am using hive 0.13.1 and hashing combination of keys using default hive hash function.
Something like select hash (date,token1,token2, parameters["a"],parameters["b"], parameters["c"]) from table1;
I ran it on 150M rows. For 60% of the rows, it hashed it correctly. For the remaining rows, it gave 0. null or 1 as hash. I looked at the rows which resulted in bad hashes, I don't see anything wrong with the rows. What could be causing it?
回答1:
The hash function returns 0 only when all supplied arguments are blank or null.
If you are familiar with Java then you may check implementation of hash function.
The hash function internally uses ObjectInspectorUtils.hashCode
to get the hashCode for the supplied fields, use below java code snippet to test manually this issue:
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.io.Text;
public class TestHash
{
public static void main( String[] args )
{
System.out.println( ObjectInspectorUtils.hashCode(null,PrimitiveObjectInspectorFactory.javaStringObjectInspector) );
System.out.println( ObjectInspectorUtils.hashCode(new Text(""),PrimitiveObjectInspectorFactory.javaStringObjectInspector) );
}
}
Maven dependencies required to run above program:
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.2</version>
</dependency>
来源:https://stackoverflow.com/questions/38617437/hive-hash-function-resulting-in-0-null-and-1-why