hadoop2 | 易学教程

How to find installation mode of Hadoop 2.x

阅读更多关于 How to find installation mode of Hadoop 2.x

问题 what is the quickest way of finding the installation mode of the Hadoop 2.x? I just want to learn the best way to find the mode when I login first time into a Hadoop installed machine. 回答1: In hadoop 2 - go to /etc/hadoop/conf folder and check the Fs.defaultFS in core-site.xml and Yarn.resourcemanager.hostname property in yarn-site.xml. The values for those properties decide which mode you are running in. Fs.defaultFS Standalone mode - file:/// pseudo distributed- hdfs://localhost:8020/ Fully

Spark-submit how to set the user.name

阅读更多关于 Spark-submit how to set the user.name

问题 want to set mapreduce.job.user.name=myuser Tried spark-submit --class com.MyClass --conf mapreduce.job.user.name=myuser \ --conf spark.mapreduce.job.user.name=myuser \ --master yarn \ --deploy-mode cluster \ Also tried --conf user.name in environment of Spark UI showing user.name yarn 回答1: Set as Runtime-Environment Variable try: --conf spark.executorEnv.mapreduce.job.user.name=myuser spark.executorEnv.[EnvironmentVariableName] - Add the environment variable specified by

Hadoop 2.7, Spark, Hive, JasperReports, Scoop - Architecuture

阅读更多关于 Hadoop 2.7, Spark, Hive, JasperReports, Scoop - Architecuture

问题 1st of all this not a question asking for help to deploy below components step by step. What I'm asking is for an advice on how the architecture should be designed. What I'm planning to do is develop a reporting platform using existing data. Following is data I gathering by researching. I have an existing RDBMS which has large number of records. So I'm using Scoop - Extract data from RDBMS to Hadoop Hadoop - Storage platform Hive - Datawarehouse Spark - Since Hive is more like batch

Hadoop multinode cluster too slow. How do I increase speed of data processing?

阅读更多关于 Hadoop multinode cluster too slow. How do I increase speed of data processing?

问题 I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml. After running an MR job, i checked my RAM Usage which is mentioned below: Namenode free -g total used free shared buff/cache available Mem: 31 7 15 0 8 22 Swap: 31 0 31 Datanode : Slave1 : free -g total used free shared buff/cache available Mem: 31 6 6 0 18 24 Swap: 31 3 28 Slave2: total used free shared buff/cache

HDFS federation

阅读更多关于 HDFS federation

问题 I have few basic questions regarding HDFS Federation . Is it possible to read file created on one name node from another name node which is in the cluster federation? Does current version of Hadoop supports this feature? 回答1: Let me explain how Name node federation works as per Apache web site NameNode: In order to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces. The Namenodes are federated; the Namenodes are independent and do not require

How is virtual memory calculated in Spark?

阅读更多关于 How is virtual memory calculated in Spark?

问题 I am using Spark on Hadoop and want to know how Spark allocates the virtual memory to executor. As per YARN vmem-pmem, it gives 2.1 times virtual memory to the container. Hence - if XMX is 1 GB then --> 1 GB * 2.1 = 2.1 GB is allocated to the container. How does it work on Spark? And is the below statement is correct? If I give Executor memory = 1 GB then, Total virtual memory = 1 GB * 2.1 * spark.yarn.executor.memoryOverhead. Is this true? If not, then how is virtual memory for an executor

apache Pig trying to get max count in each group

阅读更多关于 apache Pig trying to get max count in each group

问题 I have data of format in pig {(group, productId, count)} . Now I want to get maximum count in each group and the output might look as follows {(group, productId, maxCount)} . Here is the sample input data (south America,prod1, 45),(south America,prod2, 36), (latin america, prod1, 48),(latin america, prod5,35) here is the output for this input look like (south america, prod1, 45) (North America, prod2, 36) (latin america, prod1, 48) can someone help me on this. 回答1: Based on your sample input

Confusion over Hadoop namenode memory usage

阅读更多关于 Confusion over Hadoop namenode memory usage

问题 I have a silly doubt on Hadoop namenode memory calculation.It is mentioned in Hadoop book (Definite guide) as "Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory. While storing millions of files is

Hadoop client and cluster separation

阅读更多关于 Hadoop client and cluster separation

问题 I am a newbie in hadoop, linux as well. My professor asked us to seperate Hadoop client and cluster using port mapping or VPN. I don't understand the meaning of such separation. Can anybody give me a hint? Now I get the idea of cluster client separation. I think it is required that hadoop is also installed in the client machine. When the client submit a hadoop job , it is submit to the masters of the clusters. And I have some naiive ideas: 1.Create a client machine and install hadoop . 2.set

Hadoop : Reading ORC files and putting into RDBMS?

阅读更多关于 Hadoop : Reading ORC files and putting into RDBMS?

问题 I have a hive table which is stored in ORC files format. I want to export the data to a Teradata database. I researched sqoop but could not find a way to export ORC files. Is there a way to make sqoop work for ORC ? or is there any other tool that I could use to export the data ? Thanks. 回答1: You can use Hcatalog sqoop export --connect "jdbc:sqlserver://xxxx:1433;databaseName=xxx;USERNAME=xxx;PASSWORD=xxx" --table rdmsTableName --hcatalog-database hiveDB --hcatalog-table hiveTableName 来源：