Hadoop ORC file - How it works - How to fetch metadata

前端 未结 2 773
暖寄归人
暖寄归人 2021-02-10 03:08

I am new to ORC file. I went through many blogs, but didn\'t get clear understanding. Please help and clarify below questions.

  1. Can I fetch schema from ORC file?

2条回答
  •  故里飘歌
    2021-02-10 03:50

    1. and 2. Use Hive and/or HCatalog to create, read, update ORC table structure in the Hive metastore (HCatalog is just a side door than enables Pig/Sqoop/Spark/whatever to access the metastore directly)

    2. ALTER TABLE command allows to add/drop columns whatever the storage type, ORC included. But beware of a nasty bug that may crash vectorized reads after that (at least in V0.13 and V0.14)

    3. and 4. The term "index" is rather inappropriate. Basically it's just min/max information persisted in the stripe footer at write time, then used at read time for skipping all stripes that are clearly not meeting the WHERE requirements, drastically reducing I/O in some cases (a trick that has become popular in columns stores e.g. InfoBright on MySQL, but also in Oracle Exadata appliances [dubbed "smart scan" by Oracle marketing])

    5. Hive works with "row store" formats (Text, SequenceFile, AVRO) and "column store" formats (ORC, Parquet) alike. The optimizer just uses specific strategies and shortcuts on the initial Map phase -- e.g. stripe elimination, vectorized operators -- and of course the serialization/deserialization phases are a bit more elaborate with column stores.

提交回复
热议问题