hadoop-partitioning | 易学教程

Hadoop webuser: No such user

阅读更多关于 Hadoop webuser: No such user

While running a hadoop multi-node cluster , i got below error message on my master logs , can some advise what to do..? do i need to create a new user or can i gave my existing Machine user name over here 2013-07-25 19:41:11,765 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user webuser 2013-07-25 19:41:11,778 WARN org.apache.hadoop.security.ShellBasedUnixGroupsMapping: got exception trying to get groups for user webuser org.apache.hadoop.util.Shell$ExitCodeException: id: webuser: No such user hdfs-site.xml file <configuration> <property> <name>dfs.replication<

passing multiple dates as a paramters to Hive query

阅读更多关于 passing multiple dates as a paramters to Hive query

I am trying to pass a list of dates as parameter to my hive query. #!/bin/bash echo "Executing the hive query - Get distinct dates" var=`hive -S -e "select distinct substr(Transaction_date,0,10) from test_dev_db.TransactionUpdateTable;"` echo $var echo "Executing the hive query - Get the parition data" hive -hiveconf paritionvalue=$var -e 'SELECT Product FROM test_dev_db.TransactionMainHistoryTable where tran_date in("${hiveconf:paritionvalue}");' echo "Hive query - ends" Output as: Executing the hive query - Get distinct dates 2009-02-01 2009-04-01 Executing the hive query - Get the parition

How to use hadoop MapReuce framework for an Opencl application?

阅读更多关于 How to use hadoop MapReuce framework for an Opencl application?

问题 I am developing an application in opencl whose basic objective is to implement a data mining algorithm on GPU platform. I want to use Hadoop Distributed File System and want to execute the application on multiple nodes. I am using MapReduce framework and I have divided my basic algorithm into two parts i.e. 'Map' and 'Reduce'. I have never worked in hadoop before so I have some questions: Do I have write my application in java only to use Hadoop and Mapeduce framework? I have written kernel

How to use hadoop MapReuce framework for an Opencl application?

阅读更多关于 How to use hadoop MapReuce framework for an Opencl application?

I am developing an application in opencl whose basic objective is to implement a data mining algorithm on GPU platform. I want to use Hadoop Distributed File System and want to execute the application on multiple nodes. I am using MapReduce framework and I have divided my basic algorithm into two parts i.e. 'Map' and 'Reduce'. I have never worked in hadoop before so I have some questions: Do I have write my application in java only to use Hadoop and Mapeduce framework? I have written kernel functions for map and reduce in opencl. Is it possible to use HDFS a file system for a non java GPU

DiskErrorException on slave machine - Hadoop multinode

阅读更多关于 DiskErrorException on slave machine - Hadoop multinode

I am trying to process XML files from hadoop, i got following error on invoking word-count job on XML files . 13/07/25 12:39:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000008_0, Status : FAILED Too many fetch-failures 13/07/25 12:39:58 INFO mapred.JobClient: map 99% reduce 0% 13/07/25 12:39:59 INFO mapred.JobClient: map 100% reduce 0% 13/07/25 12:40:56 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000009_0, Status : FAILED Too many fetch-failures 13/07/25 12:40:58 INFO mapred.JobClient: map 99% reduce 0% 13/07/25 12:40:59 INFO mapred.JobClient: map 100%

DiskErrorException on slave machine - Hadoop multinode

阅读更多关于 DiskErrorException on slave machine - Hadoop multinode

问题 I am trying to process XML files from hadoop, i got following error on invoking word-count job on XML files . 13/07/25 12:39:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000008_0, Status : FAILED Too many fetch-failures 13/07/25 12:39:58 INFO mapred.JobClient: map 99% reduce 0% 13/07/25 12:39:59 INFO mapred.JobClient: map 100% reduce 0% 13/07/25 12:40:56 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000009_0, Status : FAILED Too many fetch-failures 13/07/25

Efficient way of joining multiple tables in Spark - No space left on device

阅读更多关于 Efficient way of joining multiple tables in Spark - No space left on device

A similar question has been asked here , but it does not address my question properly. I am having nearly 100 DataFrames, with each having atleast 200,000 rows and I need to join them, by doing a full join based on the column ID , thereby creating a DataFrame with columns - ID, Col1, Col2,Col3,Col4, Col5..., Col102 . Just for illustration, the structure of my DataFrames - df1 = df2 = df3 = ..... df100 = +----+------+------+------+ +----+------+ +----+------+ +----+------+ | ID| Col1| Col2| Col3| | ID| Col4| | ID| Col5| | ID|Col102| +----+------+-------------+ +----+------+ +----+------+ +----+

Hadoop handling data skew in reducer

阅读更多关于 Hadoop handling data skew in reducer

Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer. Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers. Am thinking of a two step process set mapred.max.reduce.failures.percent to say 10% and let the job complete rerun the job on the failed data set by

Hadoop handling data skew in reducer

阅读更多关于 Hadoop handling data skew in reducer

问题 Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer. Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers. Am thinking of a two step process set mapred.max

In Apache Spark, why does RDD.union not preserve the partitioner?

阅读更多关于 In Apache Spark, why does RDD.union not preserve the partitioner?

As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code: val rdd1 = sc.parallelize(1 to 50).keyBy(_ % 10) .partitionBy(new HashPartitioner(10)) val rdd2 = sc.parallelize(200 to 230).keyBy(_ % 13) val cogrouped = rdd1.cogroup(rdd2) println("cogrouped: " + cogrouped.partitioner) val unioned = rdd1.union(rdd2) println("union: " + unioned.partitioner) I see that by default cogroup() always yields an RDD with the customized partitioner, but union() doesn't, it will always