How to create n number of external tables with a single hdfs path using Hive

前端 未结 1 657
深忆病人
深忆病人 2020-11-27 23:04

Is it possible to create n number of external tables are pointing to a single hdfs path using Hive. If yes what are the advantages and its limitations.

相关标签:
1条回答
  • 2020-11-27 23:58

    It is possible to create many tables (both managed and external at the same time) on top of the same location in HDFS.

    Creating tables with exactly the same schema on top of the same data is not useful at all, but you can create different tables with different number of columns for example or with differently parsed columns using RegexSerDe for example, so you can have different schemas in these tables. And you can have different permissions on these tables in Hive. Also table can be created on top of the sub-folder of some other tables folder, in this case it will contain a sub-set of data. Better use partitions in single table for the same.

    And the drawback is that it is confusing because you can rewrite the same data using more than one table and also you may drop it accidentally, thinking this data belongs to the only table and you can drop data because you do not need that table any more.

    And this is few tests:

    Create table with INT column:

    create table T(id int);
    OK
    Time taken: 1.033 seconds
    

    Check location and other properties:

    hive> describe formatted T;
    OK
    # col_name              data_type               comment
    
    id                      int
    
    # Detailed Table Information
    Database:               my
    Owner:                  myuser
    CreateTime:             Fri Jan 04 04:45:03 PST 2019
    LastAccessTime:         UNKNOWN
    Protect Mode:           None
    Retention:              0
    Location:               hdfs://myhdp/user/hive/warehouse/my.db/t
    Table Type:             MANAGED_TABLE
    Table Parameters:
            transient_lastDdlTime   1546605903
    
    # Storage Information
    SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    InputFormat:            org.apache.hadoop.mapred.TextInputFormat
    OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    Compressed:             No
    Num Buckets:            -1
    Bucket Columns:         []
    Sort Columns:           []
    Storage Desc Params:
            serialization.format    1
    Time taken: 0.134 seconds, Fetched: 26 row(s)
                                                                                                      sts)
    

    Create second table on top of the same location but with STRING column:

    hive> create table T2(id string) location 'hdfs://myhdp/user/hive/warehouse/my.db/t';
    OK
    Time taken: 0.029 seconds
    

    Insert data:

    hive> insert into table T values(1);
    OK
    Time taken: 33.266 seconds
    

    Check data:

    hive> select * from T;
    OK
    1
    Time taken: 3.314 seconds, Fetched: 1 row(s)
    

    Insert into second table:

    hive> insert into table T2 values( 'A');
    OK
    Time taken: 23.959 seconds
    

    Check data:

    hive> select * from T2;
    OK
    1
    A
    Time taken: 0.073 seconds, Fetched: 2 row(s)
    

    Select from first table:

    hive> select * from T;
    OK
    1
    NULL
    Time taken: 0.079 seconds, Fetched: 2 row(s)
    

    String was selected as NULL because this table is defined as having INT column.

    And now insert STRING into first table (INT column):

    insert into table T values( 'A');
    OK
    Time taken: 84.336 seconds
    

    Surprise, it is not failing!

    What was inserted?

    hive> select * from T2;
    OK
    1
    A
    NULL
    Time taken: 0.067 seconds, Fetched: 3 row(s)
    

    NULL was inserted, because during previous insert string was converted to int and this resulted in NULL

    Now let's try to drop one table and select from another one:

    hive> drop table T;
    OK
    Time taken: 4.996 seconds
    hive> select * from T2;
    OK
    Time taken: 6.978 seconds
    

    Returned 0 rows because first table was MANAGED and drop table also removed common location.

    THE END,

    data is removed, do We need T2 table without data in it?

    drop table T2;
    OK
    

    Second table is removed, you see, it was metadata only. The table was also managed and drop table should remove the location with data also, but it's already nothing to remove in HDFS, only metadata was removed.

    0 讨论(0)
提交回复
热议问题