hadoop2

Querying Hbase efficiently

不羁的心 提交于 2019-12-12 10:58:04
问题 I'm using Java as a client for querying Hbase. My Hbase table is set up like this: ROWKEY | HOST | EVENT -----------|--------------|---------- 21_1465435 | host.hst.com | clicked 22_1463456 | hlo.wrld.com | dragged . . . . . . . . . The first thing I need to do is get a list of all ROWKEYs which have host.hst.com associated with it. I can create a scanner at Column host and for each row value with column value = host.hst.com I will add the corresponding ROWKEY to the list. Seems pretty

How to change HDFS replication factor for HIVE alone

核能气质少年 提交于 2019-12-12 05:25:42
问题 Our current HDFS Cluster has replication factor 1.But to improve the performance and reliability(node failure) we want to increase Hive intermediate files (hive.exec.scratchdir) replication factor alone to 5. Is it possible to implement that ? Regards, Selva 回答1: See if -setrep helps you. setrep Usage: hadoop fs -setrep [-R] [-w] <numReplicas> <path> Changes the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under

Impala GROUP BY partitioned column

拈花ヽ惹草 提交于 2019-12-12 04:14:16
问题 Theoretical question, Lets say I have table with four columns : A,B,C,D. Values of A and D are equal, table is partitioned by column A. Performance wise, would it make any difference if I issue this query SELECT SUM(B) GROUP BY A ; or this one : SELECT SUM(B) GROUP BY D ; In different words I'm asking, is there any performance gain by using the GROUP BY on partitioned column ? Thanks 回答1: Usually there are performance gains if you use the partitioned columns on a filter (WHERE clause in your

Hue 500 server error

限于喜欢 提交于 2019-12-12 04:11:51
问题 I am creating a MapReduce simple job. After submitting, its giving below error Suggest to fix this issue 回答1: I know I am too late to answer. But I have noticed that this usually gets solved if you clear your cookies. 来源: https://stackoverflow.com/questions/37207387/hue-500-server-error

Not able to recover partitions through alter table in Hive 1.2

无人久伴 提交于 2019-12-12 03:35:39
问题 I am not able to run ALTER TABLE MY_EXTERNAL_TABLE RECOVER PARTITIONS; on hive 1.2, however when i run the alternative MSCK REPAIR TABLE MY_EXTERNAL_TABLE its just listing the partitions which aren't there in Hive Meta Store and not adding it. Based on the source code from hive-exec am able to see under org/apache/hadoop/hive/ql/parse/HiveParser.g:1001:1 that theres no token matching in the grammer for RECOVER PARTITIONS. Kindly let me know if theres a way to recover all the partitions after

Yarn application not getting killed even after Application Master is terminated

岁酱吖の 提交于 2019-12-12 03:29:15
问题 My application is suffering because of this issue, which is Even after killing the application master, the application is not actually getting killed. Its a known yarn issue YARN-3561. It occurs out of blue, So I have developed a fix in my application and I want to test it. But as of now this yarn issue is not replicating again. Is there any sure-shot way of replicating this issue so I can verify my fix? 回答1: I was able to replicate this by launching the application as daemon process by using

Error message while copy file from LocalFile to hdfs

Deadly 提交于 2019-12-12 03:15:20
问题 I tried to copy file from local to hdfs. Using the command hadoop dfs -copyFromLocal in/ /user/hduser/hadoop The following error message shown. Please help to find the problem. DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. 15/02/02 19:22:23 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hduser/hadoop._COPYING_ could only be replicated to 0 nodes instead of

Apache PIG, ELEPHANTBIRDJSON Loader

陌路散爱 提交于 2019-12-12 02:14:56
问题 I'm trying to parse below input (there are 2 records in this input)using Elephantbird json loader [{"node_disk_lnum_1":36,"node_disk_xfers_in_rate_sum":136.40000000000001,"node_disk_bytes_in_rate_22": 187392.0, "node_disk_lnum_7": 13}] [{"node_disk_lnum_1": 36, "node_disk_xfers_in_rate_sum": 105.2,"node_disk_bytes_in_rate_22": 123084.8, "node_disk_lnum_7":13}] Here is my syntax: register '/home/data/Desktop/elephant-bird-pig-4.1.jar'; a = LOAD '/pig/tc1.log' USING com.twitter.elephantbird.pig

must have properties for core-site hdfs-site mapred-site and yarn-site.xml

半腔热情 提交于 2019-12-12 02:14:43
问题 Can anyone please let me know the must have properties for Core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml without which hadoop can not start? 回答1: Below settings is for Hadoop 2.x.x for Standalone and Pseudo node setup. core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name

Reuse Hadoop code in Spark efficiently?

放肆的年华 提交于 2019-12-12 01:23:01
问题 hi,I have code written in Hadoop and now I try to migrate to Spark. The mappers and reducers are fairly complex. So I tried to reuse Mapper and Reducer classes of already existing Hadoop code inside spark program. Can somebody tell me how do I achieve this? EDIT: So far, I have been able to reuse mapper class of standard hadoop word-count example in spark, implemented as below wordcount.java import scala.Tuple2; import org.apache.spark.SparkConf; import org.apache.spark.api.java.*; import org