How to handle potential data loss when performing comparisons across data types in different groups

问题

Background: Our group is going through a Cloudera upgrade to 6.1.1 and I have been tasked with determining how to handle the loss of the implicit data type conversion across data types. See link below for the relevant Release Note details.

https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_611_incompatible_changes.html#hive_union_all_returns_incorrect_data

Not only does this issue affect UNION ALL queries, but there is a function that performs comparisons on columns of different data types (i.e, STRING to BIGINT).

The group has decided that we do not want to change the underlying table meta data. So the solution is to allow for potential data loss by using the CAST() function to cast the data. In the case of UNION ALL, we cast to the destination table's meta data. But, when performing comparisons, I am trying to determine the simplest and easiest way to perform comparisons without getting erroneous results.

Question:

Can I simply cast everything to either STRING or VARCHAR() when performing the comparison? Are there any potential problems that might create incorrect results?

Update: If there are problems with this approach, is there a correct solution to handle this?

Note: this is my first engagement working with Hadoop/HIVE and I have learned that everything I know in RDBMS land does not always apply.

回答1:

It is possible that you will have problems. For instance, if comparing a string to an int, then:

'1.00' = 1 --> true, because the values are compared as numbers

But as strings:

'1.00' = '1' --> false, because the values are compared as strings

You can get similar issues with dates, I think.

来源：https://stackoverflow.com/questions/58227352/how-to-handle-potential-data-loss-when-performing-comparisons-across-data-types

标签

sql

Hadoop

Hive

Cloudera

hive-metastore