hadoop2

What is the replacement of NULLIF in Hive?

自闭症网瘾萝莉.ら 提交于 2019-12-21 18:32:07
问题 I would like to know what is the replacement of NULLIF in Hive? I am using COALESCE but its not serving my requirement. My query statement is something like : COALESCE(A,B,C) AS D COALESCE will return first NOT NULL value. But my A/B/C contain blank values so COALESCE is not assigning that value to D as it is considering blank as NOT NULL. But I want the correct value to be get assign to D. In SQL I could have use COALESCE(NULLIF(A,'')......) so it will check for blank as well. I tried CASE

Can I use 2 fields terminators(like ',' and '.') at a time in hive while creating table?

被刻印的时光 ゝ 提交于 2019-12-20 03:56:10
问题 I have a file with id and year . My fields are separated by , and . . Is there any chance I can in the place of fields terminated by can I use , and . ? 回答1: This is possible using RegexSerDe. hive> CREATE EXTERNAL TABLE citiesr1 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\.(\\S+),(\\d++.\\d++)\\t.*') LOCATION '/user/it1/hive/serde/regex'; In the regex above three regex groups are defined. (\\d+

org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout

大城市里の小女人 提交于 2019-12-19 04:08:13
问题 Getting the below error with respect to the container while submitting an spark application to YARN. The HADOOP(2.7.3)/SPARK (2.1) environment is running a pseudo-distributed mode in a single node cluster. The application works perfectly when made to run in local model however trying to check its correctness in a cluster mode using YARN as RM and hit some roadblock. New to this world hence looking for help. --- Applications logs 2017-04-11 07:13:28 INFO Client:58 - Submitting application 1 to

Merging small files in hadoop

假装没事ソ 提交于 2019-12-19 03:08:32
问题 I have a directory (Final Dir) in HDFS in which some files(ex :10 mb) are loading every minute. After some time i want to combine all the small files to a large file(ex :100 mb). But the user is continuously pushing files to Final Dir. it is a continuous process. So for the first time i need to combine the first 10 files to a large file (ex : large.txt) and save file to Finaldir. Now my question is how i will get the next 10 files excluding the first 10 files? can some please help me 回答1:

Which hadoop version should I choose among 1.x, 2.2 and 0.23

≡放荡痞女 提交于 2019-12-18 13:48:16
问题 Hello I am new to Hadoop and pretty confused with the version names and which one should I use among 1.x ( great support and learning resources ), 2.2 or 0.23. I have read that hadoop is moving to YARN completely from v0.23 ( link1 ). But at the same time its all over the web that hadoop v2.0 is moving to YARN ( link2 ) and I can see the YARN configuration files in Hadoop 2.2 itself. But since 0.23 seems to be the latest version to me, Does 2.2 also support YARN ? ( Refer link 1, it says

Tips to improve MapReduce Job performance in Hadoop

时光毁灭记忆、已成空白 提交于 2019-12-18 07:23:34
问题 I have 100 mapper and 1 reducer running in a job. How to improve the job performance? As per my understanding: Use of combiner can improve the performance to great extent. But what else we need to configure to improve the jobs performance? 回答1: With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips. But there are some general guidelines to improve the performance. If

What is Memory reserved on Yarn

Deadly 提交于 2019-12-18 04:40:44
问题 I managed to launch a spark application on Yarn. However emory usage is kind of weird as you can see below : http://imgur.com/1k6VvSI What does memory reserved mean ? How can i manage to efficiently use all the memory available ? Thanks in advance. 回答1: Check out this blog from Cloudera that explains the new memory management in YARN. Here's the pertinent bits: ... An implementation detail of this change that prevents applications from starving under this new flexibility is the notion of

Datanode not starts correctly

左心房为你撑大大i 提交于 2019-12-17 21:45:38
问题 I am trying to install Hadoop 2.2.0 in pseudo-distributed mode. While I am trying to start the datanode services it is showing the following error, can anyone please tell how to resolve this? **2**014-03-11 08:48:15,916 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool <registering> (storage id unknown) service to localhost/127.0.0.1:9000 starting to offer service 2014-03-11 08:48:15,922 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-03-11 08:48:15,922

How can I access S3/S3n from a local Hadoop 2.6 installation?

我是研究僧i 提交于 2019-12-17 06:35:28
问题 I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0. Now I would like to access an S3 bucket, as I do inside the EMR cluster. I have added the aws credentials in core-site.xml: <property> <name>fs.s3.awsAccessKeyId</name> <value>some id</value> </property> <property> <name>fs.s3n.awsAccessKeyId</name> <value>some id</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name>

How to sumit a mapreduce job to remote cluster configured with yarn?

大兔子大兔子 提交于 2019-12-14 02:22:33
问题 I am trying to execute a simple mapreduce program from eclipse .Following is my program package wordcount; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static void main(String[] args) throws