I\'m trying to filter data between September 1st, 2010 and August 31st, 2013 in a Hive table. The column containing the date is in string format (yyyy-mm-dd). I can use month()
Hive has a lot of good date parsing UDFs: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
Just doing the string comparison as Nigel Tufnel suggests is probably the easiest solution, although technically it's unsafe. But you probably don't need to worry about that unless your tables have historical data about the medieval ages (dates with only 3 year digits) or dates from scifi novels (dates with more than 4 year digits).
Anyway, if you ever find yourself in a situation where you would want to do fancier date comparisons, or if your date format is not in a "biggest to smallest" order, e.g. the American convention of "mm/dd/yyyy", then you could use unix_timestamp
with two arguments:
select *
from your_table
where unix_timestamp(your_date_column, 'yyyy-MM-dd') >= unix_timestamp('2010-09-01', 'yyyy-MM-dd')
and unix_timestamp(your_date_column, 'yyyy-MM-dd') <= unix_timestamp('2013-08-31', 'yyyy-MM-dd')
The great thing about yyyy-mm-dd
date format is that there is no need to extract month()
and year()
, you can do comparisons directly on strings:
SELECT *
FROM your_table
WHERE your_date_column >= '2010-09-01' AND your_date_column <= '2013-08-31';
Just like SQL, Hive supports BETWEEN operator for more concise statement:
SELECT *
FROM your_table
WHERE your_date_column BETWEEN '2010-09-01' AND '2013-08-31';
You have to convert string formate to required date format as following and then you can get your required result.
hive> select * from salesdata01 where from_unixtime(unix_timestamp(Order_date, 'dd-MM-yyyy'),'yyyy-MM-dd') >= from_unixtime(unix_timestamp('2010-09-01', 'yyyy-MM-dd'),'yyyy-MM-dd') and from_unixtime(unix_timestamp(Order_date, 'dd-MM-yyyy'),'yyyy-MM-dd') <= from_unixtime(unix_timestamp('2011-09-01', 'yyyy-MM-dd'),'yyyy-MM-dd') limit 10;
OK
1 3 13-10-2010 Low 6.0 261.54 0.04 Regular Air -213.25 38.94
80 483 10-07-2011 High 30.0 4965.7593 0.08 Regular Air 1198.97 195.99
97 613 17-06-2011 High 12.0 93.54 0.03 Regular Air -54.04 7.3
98 613 17-06-2011 High 22.0 905.08 0.09 Regular Air 127.7 42.76
103 643 24-03-2011 High 21.0 2781.82 0.07 Express Air -695.26 138.14
127 807 23-11-2010 Medium 45.0 196.85 0.01 Regular Air -166.85 4.28
128 807 23-11-2010 Medium 32.0 124.56 0.04 Regular Air -14.33 3.95
160 995 30-05-2011 Medium 46.0 1815.49 0.03 Regular Air 782.91 39.89
229 1539 09-03-2011 Low 33.0 511.83 0.1 Regular Air -172.88 15.99
230 1539 09-03-2011 Low 38.0 184.99 0.05 Regular Air -144.55 4.89
Time taken: 0.166 seconds, Fetched: 10 row(s)
hive> select * from salesdata01 where from_unixtime(unix_timestamp(Order_date, 'dd-MM-yyyy'),'yyyy-MM-dd') >= from_unixtime(unix_timestamp('2010-09-01', 'yyyy-MM-dd'),'yyyy-MM-dd') and from_unixtime(unix_timestamp(Order_date, 'dd-MM-yyyy'),'yyyy-MM-dd') <= from_unixtime(unix_timestamp('2010-12-01', 'yyyy-MM-dd'),'yyyy-MM-dd') limit 10;
OK
1 3 13-10-2010 Low 6.0 261.54 0.04 Regular Air -213.25 38.94
127 807 23-11-2010 Medium 45.0 196.85 0.01 Regular Air -166.85 4.28
128 807 23-11-2010 Medium 32.0 124.56 0.04 Regular Air -14.33 3.95
256 1792 08-11-2010 Low 28.0 370.48 0.04 Regular Air -5.45 13.48
381 2631 23-09-2010 Low 27.0 1078.49 0.08 Regular Air 252.66 40.96
656 4612 19-09-2010 Medium 9.0 89.55 0.06 Regular Air -375.64 4.48
769 5506 07-11-2010 Critical 22.0 129.62 0.05 Regular Air 4.41 5.88
1457 10499 16-11-2010 Not Specified 29.0 6250.936 0.01 Delivery Truck 31.21 262.11
1654 11911 10-11-2010 Critical 25.0 397.84 0.0 Regular Air -14.75 15.22
2323 16741 30-09-2010 Medium 6.0 157.97 0.01 Regular Air -42.38 22.84
Time taken: 0.17 seconds, Fetched: 10 row(s)
No need to extract the month and year.Just need to use the unix_timestamp(date String,format String) function.
For Example:
select yourdate_column
from your_table
where unix_timestamp(yourdate_column, 'yyyy-MM-dd') >= unix_timestamp('2014-06-02', 'yyyy-MM-dd')
and unix_timestamp(yourdate_column, 'yyyy-MM-dd') <= unix_timestamp('2014-07-02','yyyy-MM-dd')
order by yourdate_column limit 10;
Try this:
select * from your_table
where date >= '2020-10-01'