发表新帖

发表新帖

Hadoop ORC file - How it works - How to fetch metadata

前端未结

关注

 2  773

暖寄归人 2021-02-10 03:08

I am new to ORC file. I went through many blogs, but didn\'t get clear understanding. Please help and clarify below questions.

Can I fetch schema from ORC file?

2条回答

故里飘歌 (楼主)

2021-02-10 03:50

1. and 2. Use Hive and/or HCatalog to create, read, update ORC table structure in the Hive metastore (HCatalog is just a side door than enables Pig/Sqoop/Spark/whatever to access the metastore directly)

2. ALTER TABLE command allows to add/drop columns whatever the storage type, ORC included. But beware of a nasty bug that may crash vectorized reads after that (at least in V0.13 and V0.14)

3. and 4. The term "index" is rather inappropriate. Basically it's just min/max information persisted in the stripe footer at write time, then used at read time for skipping all stripes that are clearly not meeting the WHERE requirements, drastically reducing I/O in some cases (a trick that has become popular in columns stores e.g. InfoBright on MySQL, but also in Oracle Exadata appliances [dubbed "smart scan" by Oracle marketing])

5. Hive works with "row store" formats (Text, SequenceFile, AVRO) and "column store" formats (ORC, Parquet) alike. The optimizer just uses specific strategies and shortcuts on the initial Map phase -- e.g. stripe elimination, vectorized operators -- and of course the serialization/deserialization phases are a bit more elaborate with column stores.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题