hadoop2 | 易学教程

Querying Hbase efficiently

阅读更多关于 Querying Hbase efficiently

问题 I'm using Java as a client for querying Hbase. My Hbase table is set up like this: ROWKEY | HOST | EVENT -----------|--------------|---------- 21_1465435 | host.hst.com | clicked 22_1463456 | hlo.wrld.com | dragged . . . . . . . . . The first thing I need to do is get a list of all ROWKEYs which have host.hst.com associated with it. I can create a scanner at Column host and for each row value with column value = host.hst.com I will add the corresponding ROWKEY to the list. Seems pretty

How to change HDFS replication factor for HIVE alone

阅读更多关于 How to change HDFS replication factor for HIVE alone

问题 Our current HDFS Cluster has replication factor 1.But to improve the performance and reliability(node failure) we want to increase Hive intermediate files (hive.exec.scratchdir) replication factor alone to 5. Is it possible to implement that ? Regards, Selva 回答1: See if -setrep helps you. setrep Usage: hadoop fs -setrep [-R] [-w] <numReplicas> <path> Changes the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under

Impala GROUP BY partitioned column

阅读更多关于 Impala GROUP BY partitioned column

问题 Theoretical question, Lets say I have table with four columns : A,B,C,D. Values of A and D are equal, table is partitioned by column A. Performance wise, would it make any difference if I issue this query SELECT SUM(B) GROUP BY A ; or this one : SELECT SUM(B) GROUP BY D ; In different words I'm asking, is there any performance gain by using the GROUP BY on partitioned column ? Thanks 回答1: Usually there are performance gains if you use the partitioned columns on a filter (WHERE clause in your

Hue 500 server error

阅读更多关于 Hue 500 server error

问题 I am creating a MapReduce simple job. After submitting, its giving below error Suggest to fix this issue 回答1: I know I am too late to answer. But I have noticed that this usually gets solved if you clear your cookies. 来源： https://stackoverflow.com/questions/37207387/hue-500-server-error

Not able to recover partitions through alter table in Hive 1.2

阅读更多关于 Not able to recover partitions through alter table in Hive 1.2

问题 I am not able to run ALTER TABLE MY_EXTERNAL_TABLE RECOVER PARTITIONS; on hive 1.2, however when i run the alternative MSCK REPAIR TABLE MY_EXTERNAL_TABLE its just listing the partitions which aren't there in Hive Meta Store and not adding it. Based on the source code from hive-exec am able to see under org/apache/hadoop/hive/ql/parse/HiveParser.g:1001:1 that theres no token matching in the grammer for RECOVER PARTITIONS. Kindly let me know if theres a way to recover all the partitions after

Yarn application not getting killed even after Application Master is terminated

阅读更多关于 Yarn application not getting killed even after Application Master is terminated

问题 My application is suffering because of this issue, which is Even after killing the application master, the application is not actually getting killed. Its a known yarn issue YARN-3561. It occurs out of blue, So I have developed a fix in my application and I want to test it. But as of now this yarn issue is not replicating again. Is there any sure-shot way of replicating this issue so I can verify my fix? 回答1: I was able to replicate this by launching the application as daemon process by using

Error message while copy file from LocalFile to hdfs

阅读更多关于 Error message while copy file from LocalFile to hdfs

问题 I tried to copy file from local to hdfs. Using the command hadoop dfs -copyFromLocal in/ /user/hduser/hadoop The following error message shown. Please help to find the problem. DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. 15/02/02 19:22:23 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hduser/hadoop._COPYING_ could only be replicated to 0 nodes instead of

Apache PIG, ELEPHANTBIRDJSON Loader

阅读更多关于 Apache PIG, ELEPHANTBIRDJSON Loader

问题 I'm trying to parse below input (there are 2 records in this input)using Elephantbird json loader [{"node_disk_lnum_1":36,"node_disk_xfers_in_rate_sum":136.40000000000001,"node_disk_bytes_in_rate_22": 187392.0, "node_disk_lnum_7": 13}] [{"node_disk_lnum_1": 36, "node_disk_xfers_in_rate_sum": 105.2,"node_disk_bytes_in_rate_22": 123084.8, "node_disk_lnum_7":13}] Here is my syntax: register '/home/data/Desktop/elephant-bird-pig-4.1.jar'; a = LOAD '/pig/tc1.log' USING com.twitter.elephantbird.pig

must have properties for core-site hdfs-site mapred-site and yarn-site.xml

阅读更多关于 must have properties for core-site hdfs-site mapred-site and yarn-site.xml

问题 Can anyone please let me know the must have properties for Core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml without which hadoop can not start? 回答1: Below settings is for Hadoop 2.x.x for Standalone and Pseudo node setup. core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name

Reuse Hadoop code in Spark efficiently?

阅读更多关于 Reuse Hadoop code in Spark efficiently?

问题 hi,I have code written in Hadoop and now I try to migrate to Spark. The mappers and reducers are fairly complex. So I tried to reuse Mapper and Reducer classes of already existing Hadoop code inside spark program. Can somebody tell me how do I achieve this? EDIT: So far, I have been able to reuse mapper class of standard hadoop word-count example in spark, implemented as below wordcount.java import scala.Tuple2; import org.apache.spark.SparkConf; import org.apache.spark.api.java.*; import org