hadoop2

How to find installation mode of Hadoop 2.x

断了今生、忘了曾经 提交于 2019-12-24 13:11:51
问题 what is the quickest way of finding the installation mode of the Hadoop 2.x? I just want to learn the best way to find the mode when I login first time into a Hadoop installed machine. 回答1: In hadoop 2 - go to /etc/hadoop/conf folder and check the Fs.defaultFS in core-site.xml and Yarn.resourcemanager.hostname property in yarn-site.xml. The values for those properties decide which mode you are running in. Fs.defaultFS Standalone mode - file:/// pseudo distributed- hdfs://localhost:8020/ Fully

Spark-submit how to set the user.name

半世苍凉 提交于 2019-12-23 12:28:20
问题 want to set mapreduce.job.user.name=myuser Tried spark-submit --class com.MyClass --conf mapreduce.job.user.name=myuser \ --conf spark.mapreduce.job.user.name=myuser \ --master yarn \ --deploy-mode cluster \ Also tried --conf user.name in environment of Spark UI showing user.name yarn 回答1: Set as Runtime-Environment Variable try: --conf spark.executorEnv.mapreduce.job.user.name=myuser spark.executorEnv.[EnvironmentVariableName] - Add the environment variable specified by

Hadoop 2.7, Spark, Hive, JasperReports, Scoop - Architecuture

…衆ロ難τιáo~ 提交于 2019-12-23 05:07:23
问题 1st of all this not a question asking for help to deploy below components step by step. What I'm asking is for an advice on how the architecture should be designed. What I'm planning to do is develop a reporting platform using existing data. Following is data I gathering by researching. I have an existing RDBMS which has large number of records. So I'm using Scoop - Extract data from RDBMS to Hadoop Hadoop - Storage platform Hive - Datawarehouse Spark - Since Hive is more like batch

Hadoop multinode cluster too slow. How do I increase speed of data processing?

旧城冷巷雨未停 提交于 2019-12-23 04:53:40
问题 I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml. After running an MR job, i checked my RAM Usage which is mentioned below: Namenode free -g total used free shared buff/cache available Mem: 31 7 15 0 8 22 Swap: 31 0 31 Datanode : Slave1 : free -g total used free shared buff/cache available Mem: 31 6 6 0 18 24 Swap: 31 3 28 Slave2: total used free shared buff/cache

HDFS federation

牧云@^-^@ 提交于 2019-12-23 02:50:48
问题 I have few basic questions regarding HDFS Federation . Is it possible to read file created on one name node from another name node which is in the cluster federation? Does current version of Hadoop supports this feature? 回答1: Let me explain how Name node federation works as per Apache web site NameNode: In order to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces. The Namenodes are federated; the Namenodes are independent and do not require

How is virtual memory calculated in Spark?

旧城冷巷雨未停 提交于 2019-12-23 02:42:32
问题 I am using Spark on Hadoop and want to know how Spark allocates the virtual memory to executor. As per YARN vmem-pmem, it gives 2.1 times virtual memory to the container. Hence - if XMX is 1 GB then --> 1 GB * 2.1 = 2.1 GB is allocated to the container. How does it work on Spark? And is the below statement is correct? If I give Executor memory = 1 GB then, Total virtual memory = 1 GB * 2.1 * spark.yarn.executor.memoryOverhead. Is this true? If not, then how is virtual memory for an executor

apache Pig trying to get max count in each group

本秂侑毒 提交于 2019-12-23 01:47:28
问题 I have data of format in pig {(group, productId, count)} . Now I want to get maximum count in each group and the output might look as follows {(group, productId, maxCount)} . Here is the sample input data (south America,prod1, 45),(south America,prod2, 36), (latin america, prod1, 48),(latin america, prod5,35) here is the output for this input look like (south america, prod1, 45) (North America, prod2, 36) (latin america, prod1, 48) can someone help me on this. 回答1: Based on your sample input

Confusion over Hadoop namenode memory usage

落爺英雄遲暮 提交于 2019-12-22 11:46:25
问题 I have a silly doubt on Hadoop namenode memory calculation.It is mentioned in Hadoop book (Definite guide) as "Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory. While storing millions of files is

Hadoop client and cluster separation

社会主义新天地 提交于 2019-12-21 23:18:23
问题 I am a newbie in hadoop, linux as well. My professor asked us to seperate Hadoop client and cluster using port mapping or VPN. I don't understand the meaning of such separation. Can anybody give me a hint? Now I get the idea of cluster client separation. I think it is required that hadoop is also installed in the client machine. When the client submit a hadoop job , it is submit to the masters of the clusters. And I have some naiive ideas: 1.Create a client machine and install hadoop . 2.set

Hadoop : Reading ORC files and putting into RDBMS?

戏子无情 提交于 2019-12-21 20:05:10
问题 I have a hive table which is stored in ORC files format. I want to export the data to a Teradata database. I researched sqoop but could not find a way to export ORC files. Is there a way to make sqoop work for ORC ? or is there any other tool that I could use to export the data ? Thanks. 回答1: You can use Hcatalog sqoop export --connect "jdbc:sqlserver://xxxx:1433;databaseName=xxx;USERNAME=xxx;PASSWORD=xxx" --table rdmsTableName --hcatalog-database hiveDB --hcatalog-table hiveTableName 来源: