following is already been achieved
[take 2] OK, so you can't properly "stream" data into Hive. But you can add a periodic compaction post-processing job...
(role='collectA')
, (role='collectB')
, (role='archive')
(role='activeA')
(role='activeB')
then dump every record that you have collected in the "A" partition into "archive", hoping that Hive default config will do a good job of limiting fragmentation
INSERT INTO TABLE twitter_data PARTITION (role='archive')
SELECT ...
FROM twitter_data WHERE role='activeA'
;
TRUNCATE TABLE twitter_data PARTITION (role='activeA')
;
at some point, switch back to "A" etc.
One last word: if Hive still creates too many files on each compaction job, then try tweaking some parameters in your session, just before the INSERT e.g.
set hive.merge.mapfiles =true;
set hive.merge.mapredfiles =true;
set hive.merge.smallfiles.avgsize=1024000000;
you can use these options together.
Hive was designed for massive batch processing, not for transactions. That's why you have at least one data file for each LOAD or INSERT-SELECT command. And that's also why you have no INSERT-VALUES command, hence the lame syntax displayed in your post as a necessary workaround.
Well... that was true until transaction support was introduced. In a nutshell you need (a) Hive V0.14 and later (b) an ORC table (c) transaction support enabled on that table (i.e. locks, periodic background compaction, etc)
The wiki about Streaming data ingest in Hive might be a good start.