How to force MR execution when running simple Hive query?

断了今生、忘了曾经 提交于 2021-01-28 13:31:18

问题


There is Hive 2.1.1 over MR, table test_table stored as sequencefile and the following ad-hoc query:

select t.*
  from test_table t
 where t.test_column = 100

Although this query can be executed without starting MR (fetch task), sometimes it takes longer to scan HDFS files rather than triggering a single map job.

When I want to enforce MR execution, I make the query more complex: e.g., using distinct. The significant drawbacks of this approach are:

  1. Query results may differ from the original query's
  2. Brings meaningless calculation load on the cluster

Is there a recommended way to force MR execution when using Hive-on-MR?


回答1:


The hive executor decides either to execute map task or fetch task depending on the following settings (with defaults):

  • hive.fetch.task.conversion ("more") — the strategy for converting MR tasks into fetch tasks
  • hive.fetch.task.conversion.threshold (1 GB) — max size of input data that can be fed to a fetch task
  • hive.fetch.task.aggr (false) — when set to true, queries like select count(*) from src also can be executed in a fetch task

It prompts me the following two options:

  1. set hive.fetch.task.conversion.threshold to a lower value, e.g. 512 Mb
  2. set hive.fetch.task.conversion to "none"

For some reason lowering the threshold did not change anything in my case, so I stood with the second option: seems fine for ad-hoc queries.

More details regarding these settings can be found in Cloudera forum and Hive wiki.




回答2:


Just add set hive.execution.engine=mr; before your query and it will enforce Hive to use MR.



来源:https://stackoverflow.com/questions/62536791/how-to-force-mr-execution-when-running-simple-hive-query

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!