hadoop-streaming

Intellij connect hortonwork spark remotely failed

不问归期 提交于 2019-12-25 07:26:54
问题 I have a hortonwork sandbox 2.4 with spark 1.6 set up. Then I create Intellij spark development environment in windows using hdp spark jar and scala 2.10.5. So both spark and scala version are matched between my windows and hdp environment as indicated here. And my Intellij dev environment works with local as Master. Then I'm trying to connect hdp in windows using val sparkConf = new SparkConf() .setAppName("spark-word-count") .setMaster("spark://10.33.241.160:7077") And I get below error

NullPointerException with MR2 in windows

你离开我真会死。 提交于 2019-12-25 07:17:51
问题 I have installed Hadoop 2.3.0 in windows and able to execute MR jobs successfully. But while trying with streaming sample in C# [with HadoopSDK's .Net assemblies] the app ends with the following exception 14/05/16 18:21:06 INFO mapreduce.Job: Task Id : attempt_1400239892040_0003_r_000000_0, Status : FAILED Error: java.lang.NullPointerException at org.apache.hadoop.mapred.Task.getFsStatistics(Task.java:347) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java

java.lang.NullPointerException at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.close

别说谁变了你拦得住时间么 提交于 2019-12-25 05:26:09
问题 I am running two map-reduce pairs. The output of first map-reduce is being used as the input for the next map-reduce. In order to do that I have given the job.setOutputFormatClass(SequenceFileOutputFormat.class). While running the following Driver class: package org; import org.apache.commons.configuration.ConfigurationFactory; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache

Using gzip as a reducer produces corrupt data

混江龙づ霸主 提交于 2019-12-25 01:28:49
问题 When I run hadoop streaming like this: hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -Dmapred.reduce.tasks=16 -input foo -output bar -mapper "python zot.py" -reducer gzip I get 16 files in the output directory which are, alas, corrupt: $ hadoop fs -get bar/part-00012 $ file part-00012 gzip compressed data, from Unix $ cat part-00012 | gunzip >/dev/null gzip: stdin: invalid compressed data--format violated when I inspect the output of cat part-00012 | gunzip

Hadoop environment variables

做~自己de王妃 提交于 2019-12-24 19:39:54
问题 I'm trying to debug some issues with a single node Hadoop cluster on my Mac. In all the setup docs it says to add: export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" to remove this error: Unable to load realm info from SCDynamicStore This works, but it only seems to work for STDOUT. When I check my Hadoop logs directory, under "job_###/atempt_###/stderr" the error is still there: 2013-02-08 09:58:23.662 java[2772:1903] Unable to load

Facing issue in Mapper.py and Reducer.py when running code in Hadoop cluster

流过昼夜 提交于 2019-12-24 19:33:51
问题 Running this code to take Probability in Hadoop cluster my data in CSV File. When I run this code in cluster getting this error "java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1" anyone fix my code. #!/usr/bin/env python3 """mapper.py""" import sys # Get input lines from stdin for line in sys.stdin: # Remove spaces from beginning and end of the line line = line.strip() # Split it into tokens #tokens = line.split() #Get probability_mass values for

Using files in Hadoop Streaming with Python

拥有回忆 提交于 2019-12-24 06:19:13
问题 I am completely new to Hadoop and MapReduce and am trying to work my way through it. I am trying to develop a mapreduce application in python, in which I use data from 2 .CSV files. I am just reading the two files in mapper and then printing the key value pair from the files to the sys.stdout The program runs fine when I use it on a single machine, but with the Hadoop Streaming, I get an error. I think I am making some mistake in reading files in the mapper on Hadoop. Please help me out with

Cutting down bag to pass to udf

China☆狼群 提交于 2019-12-24 04:23:09
问题 Using Pig on a Hadoop cluster, I have a huge bag of huge tuples which I regularly add fields to as I continue to work on this project, and several UDFs which use various fields from it. I want to be able to call a UDF on just a few fields from each tuple and reconnect the result to that particular tuple . Doing a join to reconnect the records using unique ids takes forever on billions of records. I think there should be a way to do this all inside the GENERATE statement, but I can't find the

Can we cascade multiple MapReduce jobs in Hadoop Streaming (lang: Python)

旧城冷巷雨未停 提交于 2019-12-24 02:47:06
问题 I am using Python and have to work on following scenario using Hadoop Streaming: a) Map1->Reduce1->Map2->Reduce2 b) I dont want to store intermediate files c) I dont want to install packages like Cascading, Yelp, Oozie. I have kept them as last option. I already went through the same kind of discussion on SO and elsewhere but could not find an answer wrt Python. Can you please suggest. 回答1: b) I dont want to store intermediate files c) I dont want to install packages like Cascading, Yelp,

Hadoop on CentOS streaming example with python - permission denied on /mapred/local/taskTracker

自作多情 提交于 2019-12-23 22:33:20
问题 I have been able to set up the streaming example with python mapper & reducer. The mapred folder location is /mapred/local/taskTracker both root & mapred users have the ownership to this folder & sub folders however when I run my streaming it creates maps but no reduces and gives the following error Cannot Run Program /mapred/local/taskTracker/root/jobcache/job_201303071607_0035/attempt_201303071607_0035_m_000001_3/work/./mapper1.py Permission Denied I noticed that though it have provided a