apache-tez

How to reduce generating files of SQL “Alter Table/Partition Concatenate” in Hive?

天大地大妈咪最大 提交于 2019-12-05 07:01:10
Hive version: 1.2.1 Configuration: set hive.execution.engine=tez; set hive.merge.mapredfiles=true; set hive.merge.smallfiles.avgsize=256000000; set hive.merge.tezfiles=true; HQL: ALTER TABLE `table_name` PARTITION (partion_name1 = 'val1', partion_name2='val2', partion_name3='val3', partion_name4='val4') CONCATENATE; I use the HQL to merge files of specific table / partition. However, after execution there are still many files in output directory; and their size are far less than 256000000. So how to decrease the number of output files. BTW, use MapReduce instead of Tez also didn't work. You

Map-Reduce Logs on Hive-Tez

社会主义新天地 提交于 2019-12-04 09:11:51
I want to get the interpretation of Map-Reduce logs after running a query on Hive-Tez ? What the lines after INFO: conveys ? Here I have attached a sample INFO : Session is already open INFO : Dag name: SELECT a.Model...) INFO : Tez session was closed. Reopening... INFO : Session re-established. INFO : INFO : Status: Running (Executing on YARN cluster with App id application_14708112341234_1234) INFO : Map 1: -/- Map 3: -/- Map 4: -/- Map 7: -/- Reducer 2: 0/15 Reducer 5: 0/26 Reducer 6: 0/13 INFO : Map 1: -/- Map 3: 0/118 Map 4: 0/118 Map 7: 0/1 Reducer 2: 0/15 Reducer 5: 0/26 Reducer 6: 0/13

Is Hive faster than Spark?

喜你入骨 提交于 2019-12-04 09:11:39
After reading What is hive, Is it a database? , a colleague yesterday mentioned that he was able to filter a 15B table, join it with another table after doing a "group by", which resulted in 6B records, in only 10 minutes! I wonder if this would be slower in Spark, since now with the DataFrames, they may be comparable, but I am not sure, thus the question. Is Hive faster than Spark? Or this question doesn't have meaning? Sorry, for my ignorance. He uses the latest Hive, which from seems to be using Tez. Hive is just a framework that gives sql functionality to MapReduce type workloads. These

how to reduce the number of containers in the query

混江龙づ霸主 提交于 2019-12-01 10:05:39
问题 I have a query using to much containers and to much memory. (97% of the memory used). Is there a way to set the number of containers used in the query and limit the max memory? The query is running on Tez. Thanks in advance 回答1: Controlling the number of Mappers: The number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also How initial task parallelism works MR uses CombineInputFormat, while

Hive tez execution error

廉价感情. 提交于 2019-11-30 20:18:57
问题 I am running a hive query and I got the following error when setting the hive.execution.engine=tez, while the query is working under engine=MR. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask My query is an inner join and the data is quite big. Another thing is that I have met this problem before. But tez works later so I thought it was about something unstable about hive. 回答1: While running your HQL via hive include following parameter. This will give

Why hive_staging file is missing in AWS EMR

自古美人都是妖i 提交于 2019-11-29 13:37:14
Problem - I am running 1 query in AWS EMR. It is failing by throwing exception - java.io.FileNotFoundException: File s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639 does not exist. I mentioned all the related information for this problem below. Please check. Query - INSERT OVERWRITE TABLE base_performance_order_dedup_20160917 SELECT * FROM ( select commerce_feed_redshift_dedup.sku AS sku, commerce_feed_redshift_dedup.revenue AS revenue, commerce_feed_redshift_dedup.orders AS orders,

could only be replicated to 0 nodes instead of minReplication (=1). There are 4 datanode(s) running and no node(s) are excluded in this operation

£可爱£侵袭症+ 提交于 2019-11-28 00:23:06
问题 I don't know how to fix this error: Vertex failed, vertexName=initialmap, vertexId=vertex_1449805139484_0001_1_00, diagnostics=[Task failed, taskId=task_1449805139484_0001_1_00_000003, diagnostics=[AttemptID:attempt_1449805139484_0001_1_00_000003_0 Info:Error: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hadoop/gridmix-kon/input/_temporary/1/_temporary/attempt_14498051394840_0001_m_000003_0/part-m-00003/segment-121 could only be replicated to 0 nodes instead of