How to load data to same Hive table if file has different number of columns

问题

I have a main table (Employee) which is having 10 columns and I can load data into it using load data inpath /file1.txt into table Employee

My question is how to handle the same table (Employee) if my file file2.txt has same columns but column 3 and columns 5 are missing. if I directly load data last columns will be NULL NULL. but instead it should load 3rd as NULL and 5th column as NULL.

Suppose I have a table Employee and I want to load the file1.txt and file2.txt to table.

file1.txt
==========
id name sal deptid state coutry  
1  aaa  1000 01   TS   india  
2  bbb  2000 02   AP   india  
3  ccc  3000 03   BGL   india  


file2.txt  

id  name   deptid country  
1  second   001   US  
2  third    002   ENG  
3  forth    003   AUS

In file2.txt we are missing 2 columns i.e. sal and state.

we need to use the same Employee table how to handle it ?

回答1:

I'm not aware of any way to create a table backed by data files with a non-homogenous structure. What you can do however, is to define separate tables for the different column configurations and then define a view that queries both.

I think it's easier if I provide an example. I will use two tables of people, both have a column for name, but one stores height as well, while the other stores weight instead:

> create table table1(name string, height int);
> insert into table1 values ('Alice', 178), ('Charlie', 185);

> create table table2(name string, weight int);
> insert into table2 values ('Bob', 98), ('Denise', 52);

> create view people as
>     select name, height, NULL as weight from table1
>   union all
>     select name, NULL as height, weight from table2;

> select * from people order by name;
+---------+--------+--------+
| name    | height | weight |
+---------+--------+--------+
| Alice   | 178    | NULL   |
| Bob     | NULL   | 98     |
| Charlie | 185    | NULL   |
| Denise  | NULL   | 52     |
+---------+--------+--------+

Or as a closer example to your problem, let's say that one table has name, height and weight, while the other only has name and weight, thereby height is "missing from the middle":

> create table table1(name string, height int, weight int);
> insert into table1 values ('Alice', 178, 55), ('Charlie', 185, 78);

> create table table2(name string, weight int);
> insert into table2 values ('Bob', 98), ('Denise', 52);

> create view people as
>     select name, height, weight from table1
>   union all
>     select name, NULL as height, weight from table2;

> select * from people order by name;
+---------+--------+--------+
| name    | height | weight |
+---------+--------+--------+
| Alice   | 178    | 55     |
| Bob     | NULL   | 98     |
| Charlie | 185    | 78     |
| Denise  | NULL   | 52     |
+---------+--------+--------+

Be sure to use union all and not just union, because the latter tries to remove duplicate rows, which makes it very expensive.

回答2:

It seems like there is no way to directly load into specified columns.

As such, this is what you probably need to do:

Load data inpath to a (temporary?) table that matches the file
Insert into relevant columns of final table by selecting the contents of the previous table.

The situation is very similar to this question which covers the opposite scenario (you only want to load a few columns).

来源：https://stackoverflow.com/questions/39580335/how-to-load-data-to-same-hive-table-if-file-has-different-number-of-columns

标签

Hadoop

Hive

hiveql

hadoop2