hadoop2 | 易学教程

What is the replacement of NULLIF in Hive?

阅读更多关于 What is the replacement of NULLIF in Hive?

问题 I would like to know what is the replacement of NULLIF in Hive? I am using COALESCE but its not serving my requirement. My query statement is something like : COALESCE(A,B,C) AS D COALESCE will return first NOT NULL value. But my A/B/C contain blank values so COALESCE is not assigning that value to D as it is considering blank as NOT NULL. But I want the correct value to be get assign to D. In SQL I could have use COALESCE(NULLIF(A,'')......) so it will check for blank as well. I tried CASE

Can I use 2 fields terminators(like ',' and '.') at a time in hive while creating table?

阅读更多关于 Can I use 2 fields terminators(like ',' and '.') at a time in hive while creating table?

问题 I have a file with id and year . My fields are separated by , and . . Is there any chance I can in the place of fields terminated by can I use , and . ? 回答1: This is possible using RegexSerDe. hive> CREATE EXTERNAL TABLE citiesr1 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\.(\\S+),(\\d++.\\d++)\\t.*') LOCATION '/user/it1/hive/serde/regex'; In the regex above three regex groups are defined. (\\d+

org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout

阅读更多关于 org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout

问题 Getting the below error with respect to the container while submitting an spark application to YARN. The HADOOP(2.7.3)/SPARK (2.1) environment is running a pseudo-distributed mode in a single node cluster. The application works perfectly when made to run in local model however trying to check its correctness in a cluster mode using YARN as RM and hit some roadblock. New to this world hence looking for help. --- Applications logs 2017-04-11 07:13:28 INFO Client:58 - Submitting application 1 to

Merging small files in hadoop

阅读更多关于 Merging small files in hadoop

问题 I have a directory (Final Dir) in HDFS in which some files(ex :10 mb) are loading every minute. After some time i want to combine all the small files to a large file(ex :100 mb). But the user is continuously pushing files to Final Dir. it is a continuous process. So for the first time i need to combine the first 10 files to a large file (ex : large.txt) and save file to Finaldir. Now my question is how i will get the next 10 files excluding the first 10 files? can some please help me 回答1:

Which hadoop version should I choose among 1.x, 2.2 and 0.23

阅读更多关于 Which hadoop version should I choose among 1.x, 2.2 and 0.23

问题 Hello I am new to Hadoop and pretty confused with the version names and which one should I use among 1.x ( great support and learning resources ), 2.2 or 0.23. I have read that hadoop is moving to YARN completely from v0.23 ( link1 ). But at the same time its all over the web that hadoop v2.0 is moving to YARN ( link2 ) and I can see the YARN configuration files in Hadoop 2.2 itself. But since 0.23 seems to be the latest version to me, Does 2.2 also support YARN ? ( Refer link 1, it says

Tips to improve MapReduce Job performance in Hadoop

阅读更多关于 Tips to improve MapReduce Job performance in Hadoop

问题 I have 100 mapper and 1 reducer running in a job. How to improve the job performance? As per my understanding: Use of combiner can improve the performance to great extent. But what else we need to configure to improve the jobs performance? 回答1: With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips. But there are some general guidelines to improve the performance. If

What is Memory reserved on Yarn

阅读更多关于 What is Memory reserved on Yarn

问题 I managed to launch a spark application on Yarn. However emory usage is kind of weird as you can see below : http://imgur.com/1k6VvSI What does memory reserved mean ? How can i manage to efficiently use all the memory available ? Thanks in advance. 回答1: Check out this blog from Cloudera that explains the new memory management in YARN. Here's the pertinent bits: ... An implementation detail of this change that prevents applications from starving under this new flexibility is the notion of

Datanode not starts correctly

阅读更多关于 Datanode not starts correctly

问题 I am trying to install Hadoop 2.2.0 in pseudo-distributed mode. While I am trying to start the datanode services it is showing the following error, can anyone please tell how to resolve this? **2**014-03-11 08:48:15,916 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool <registering> (storage id unknown) service to localhost/127.0.0.1:9000 starting to offer service 2014-03-11 08:48:15,922 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-03-11 08:48:15,922

How can I access S3/S3n from a local Hadoop 2.6 installation?

阅读更多关于 How can I access S3/S3n from a local Hadoop 2.6 installation?

问题 I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0. Now I would like to access an S3 bucket, as I do inside the EMR cluster. I have added the aws credentials in core-site.xml: <property> <name>fs.s3.awsAccessKeyId</name> <value>some id</value> </property> <property> <name>fs.s3n.awsAccessKeyId</name> <value>some id</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name>

How to sumit a mapreduce job to remote cluster configured with yarn?

阅读更多关于 How to sumit a mapreduce job to remote cluster configured with yarn?

问题 I am trying to execute a simple mapreduce program from eclipse .Following is my program package wordcount; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static void main(String[] args) throws