How to handle potential data loss when performing comparisons across data types in different groups

ぃ、小莉子 提交于 2019-12-11 19:35:53

问题


Background: Our group is going through a Cloudera upgrade to 6.1.1 and I have been tasked with determining how to handle the loss of the implicit data type conversion across data types. See link below for the relevant Release Note details.

https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_611_incompatible_changes.html#hive_union_all_returns_incorrect_data

Not only does this issue affect UNION ALL queries, but there is a function that performs comparisons on columns of different data types (i.e, STRING to BIGINT).

The group has decided that we do not want to change the underlying table meta data. So the solution is to allow for potential data loss by using the CAST() function to cast the data. In the case of UNION ALL, we cast to the destination table's meta data. But, when performing comparisons, I am trying to determine the simplest and easiest way to perform comparisons without getting erroneous results.

Question:

Can I simply cast everything to either STRING or VARCHAR() when performing the comparison? Are there any potential problems that might create incorrect results?

Update: If there are problems with this approach, is there a correct solution to handle this?

Note: this is my first engagement working with Hadoop/HIVE and I have learned that everything I know in RDBMS land does not always apply.


回答1:


It is possible that you will have problems. For instance, if comparing a string to an int, then:

  • '1.00' = 1 --> true, because the values are compared as numbers

But as strings:

  • '1.00' = '1' --> false, because the values are compared as strings

You can get similar issues with dates, I think.



来源:https://stackoverflow.com/questions/58227352/how-to-handle-potential-data-loss-when-performing-comparisons-across-data-types

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!