hadoop2 | 易学教程

Hadoop : how to start my first project

阅读更多关于 Hadoop : how to start my first project

问题 I'm starting to work with Hadoop but I don't know where and how do it. I'm working on OS X and I follow some tutorial to install Hadoop, it's done and it's work but now I don't know what to do. Is there an IDE to install (maybe eclipse)? I find some codes but nothing works and I don't know what I have to add in my project etc ... Can you give me some informations or guide me to a complete tutorial ? 回答1: If you want to learn Hadoop framework then i recomend to just start with installing

Hadoop release version confusing

阅读更多关于 Hadoop release version confusing

问题 I am trying to figure out the different versions of hadoop and I got confusing after reading this page. Download 1.2.X - current stable version, 1.2 release 2.2.X - current stable 2.x version 2.3.X - current 2.x version 0.23.X - similar to 2.X.X but missing NN HA. Releases may be downloaded from Apache mirrors. Question: I think any release starting with 0.xx means it is a alpha version and should be not used in product, is that the case? What is the difference between 0.23.X and 2.3.X? it

Hadoop installation Issue:

阅读更多关于 Hadoop installation Issue:

问题 I followed this tutorial for installation of Hadoop. Unfortunately, when I run the start-all.sh script - The following error was printed on console: hduser@dennis-HP:/usr/local/hadoop/sbin$ start-all.sh This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh hadoop config script is run... hdfs script is run... Config parameter : 16/04/10 23:45:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Convert a text file to sequence format in Spark Java

阅读更多关于 Convert a text file to sequence format in Spark Java

问题 In Spark Java, how do I convert a text file to a sequence file? The following is my code: SparkConf sparkConf = new SparkConf().setAppName("txt2seq"); sparkConf.setMaster("local").set("spark.executor.memory", "1g"); sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); JavaPairRDD<String, String> infile = ctx.wholeTextFiles("input_txt"); infile.saveAsNewAPIHadoopFile("outfile.seq", String.class, String.class,

How to register InternalRow with Kryo in Spark

阅读更多关于 How to register InternalRow with Kryo in Spark

问题 I want to run Spark with Kryo serialisation. Therefore I set spark.serializer=org.apache.spark.serializer.KryoSerializer and spark.kryo.registrationRequired=true When I then run my code I get the error: Class is not registered: org.apache.spark.sql.catalyst.InternalRow[] According to this post I used sc.getConf.registerKryoClasses(Array( classOf[ org.apache.spark.sql.catalyst.InternalRow[_] ] )) But then the error is: org.apache.spark.sql.catalyst.InternalRow does not take type parameters 回答1

How concurrent # mappers and # reducers are calculated in Hadoop 2 + YARN?

阅读更多关于 How concurrent # mappers and # reducers are calculated in Hadoop 2 + YARN?

问题 I've searched by sometime and I've found that a MapReduce cluster using hadoop2 + yarn has the following number of concurrent maps and reduces per node: Concurrent Maps # = yarn.nodemanager.resource.memory-mb / mapreduce.map.memory.mb Concurrent Reduces # = yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb However, I've set up a cluster with 10 machines, with these configurations: 'yarn_site' => { 'yarn.nodemanager.resource.cpu-vcores' => '32', 'yarn.nodemanager.resource.memory

spark2 + yarn - nullpointerexception while preparing AM container

阅读更多关于 spark2 + yarn - nullpointerexception while preparing AM container

问题 I'm trying to run pyspark --master yarn Spark version: 2.0.0 Hadoop version: 2.7.2 Hadoop yarn web interface is successfully started This is what happens: 16/08/15 10:00:12 DEBUG Client: Using the default MR application classpath: $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/* 16/08/15 10:00:12 INFO Client: Preparing resources for our AM container 16/08/15 10:00:12 DEBUG Client: 16/08/15 10:00:12 DEBUG DFSClient: /user/mispp/.sparkStaging

Kinesis Stream with Empty Records in Google Dataproc with Spark 1.6.1 Hadoop 2.7.2

阅读更多关于 Kinesis Stream with Empty Records in Google Dataproc with Spark 1.6.1 Hadoop 2.7.2

问题 I am trying to connect to Amazon Kinesis Stream from Google Dataproc but am only getting Empty RDDs. Command: spark-submit --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX --awsSecretKey XXXX Detailed Log: https://gist.github.com/sshrestha-datalicious/e3fc8ebb4916f27735a97e9fcc42136c More Details Spark 1.6.1 Hadoop 2.7.2 Assembly Used: /usr/lib/spark/lib/spark-assembly-1.6.1-hadoop2.7.2.jar Surprisingly that works

Spark on Yarn Container Failure

阅读更多关于 Spark on Yarn Container Failure

问题 For reference: I solved this issue by adding Netty 4.1.17 in hadoop/share/hadoop/common No matter what jar I try and run (including the example from https://spark.apache.org/docs/latest/running-on-yarn.html), I keep getting an error regarding container failure when running Spark on Yarn. I get this error in the command prompt: Diagnostics: Exception from container-launch. Container id: container_1530118456145_0001_02_000001 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache

Spark checkpointing error when joining static dataset with DStream

阅读更多关于 Spark checkpointing error when joining static dataset with DStream

问题 I am trying to use Spark Streaming application in Java. My Spark application reads continuous feed from Hadoop directory using textFileStream() at interval of each 1 Min. I need to perform Spark aggregation(group by) operation on incoming DStream. After aggregation, I am joining aggregated DStream<Key, Value1> with RDD<Key, Value2> with RDD<Key, Value2> created from static dataset read by textFile() from hadoop directory. Problem comes when I enable checkpointing. With empty checkpoint