hadoop-partitioning

Hadoop webuser: No such user

寵の児 提交于 2019-12-04 18:19:12
While running a hadoop multi-node cluster , i got below error message on my master logs , can some advise what to do..? do i need to create a new user or can i gave my existing Machine user name over here 2013-07-25 19:41:11,765 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user webuser 2013-07-25 19:41:11,778 WARN org.apache.hadoop.security.ShellBasedUnixGroupsMapping: got exception trying to get groups for user webuser org.apache.hadoop.util.Shell$ExitCodeException: id: webuser: No such user hdfs-site.xml file <configuration> <property> <name>dfs.replication<

passing multiple dates as a paramters to Hive query

风格不统一 提交于 2019-12-04 16:06:16
I am trying to pass a list of dates as parameter to my hive query. #!/bin/bash echo "Executing the hive query - Get distinct dates" var=`hive -S -e "select distinct substr(Transaction_date,0,10) from test_dev_db.TransactionUpdateTable;"` echo $var echo "Executing the hive query - Get the parition data" hive -hiveconf paritionvalue=$var -e 'SELECT Product FROM test_dev_db.TransactionMainHistoryTable where tran_date in("${hiveconf:paritionvalue}");' echo "Hive query - ends" Output as: Executing the hive query - Get distinct dates 2009-02-01 2009-04-01 Executing the hive query - Get the parition

How to use hadoop MapReuce framework for an Opencl application?

我的梦境 提交于 2019-12-04 06:07:28
问题 I am developing an application in opencl whose basic objective is to implement a data mining algorithm on GPU platform. I want to use Hadoop Distributed File System and want to execute the application on multiple nodes. I am using MapReduce framework and I have divided my basic algorithm into two parts i.e. 'Map' and 'Reduce'. I have never worked in hadoop before so I have some questions: Do I have write my application in java only to use Hadoop and Mapeduce framework? I have written kernel

How to use hadoop MapReuce framework for an Opencl application?

白昼怎懂夜的黑 提交于 2019-12-02 10:28:46
I am developing an application in opencl whose basic objective is to implement a data mining algorithm on GPU platform. I want to use Hadoop Distributed File System and want to execute the application on multiple nodes. I am using MapReduce framework and I have divided my basic algorithm into two parts i.e. 'Map' and 'Reduce'. I have never worked in hadoop before so I have some questions: Do I have write my application in java only to use Hadoop and Mapeduce framework? I have written kernel functions for map and reduce in opencl. Is it possible to use HDFS a file system for a non java GPU

DiskErrorException on slave machine - Hadoop multinode

戏子无情 提交于 2019-12-01 14:44:12
I am trying to process XML files from hadoop, i got following error on invoking word-count job on XML files . 13/07/25 12:39:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000008_0, Status : FAILED Too many fetch-failures 13/07/25 12:39:58 INFO mapred.JobClient: map 99% reduce 0% 13/07/25 12:39:59 INFO mapred.JobClient: map 100% reduce 0% 13/07/25 12:40:56 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000009_0, Status : FAILED Too many fetch-failures 13/07/25 12:40:58 INFO mapred.JobClient: map 99% reduce 0% 13/07/25 12:40:59 INFO mapred.JobClient: map 100%

DiskErrorException on slave machine - Hadoop multinode

那年仲夏 提交于 2019-12-01 12:48:37
问题 I am trying to process XML files from hadoop, i got following error on invoking word-count job on XML files . 13/07/25 12:39:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000008_0, Status : FAILED Too many fetch-failures 13/07/25 12:39:58 INFO mapred.JobClient: map 99% reduce 0% 13/07/25 12:39:59 INFO mapred.JobClient: map 100% reduce 0% 13/07/25 12:40:56 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000009_0, Status : FAILED Too many fetch-failures 13/07/25

Efficient way of joining multiple tables in Spark - No space left on device

丶灬走出姿态 提交于 2019-11-30 23:53:51
A similar question has been asked here , but it does not address my question properly. I am having nearly 100 DataFrames, with each having atleast 200,000 rows and I need to join them, by doing a full join based on the column ID , thereby creating a DataFrame with columns - ID, Col1, Col2,Col3,Col4, Col5..., Col102 . Just for illustration, the structure of my DataFrames - df1 = df2 = df3 = ..... df100 = +----+------+------+------+ +----+------+ +----+------+ +----+------+ | ID| Col1| Col2| Col3| | ID| Col4| | ID| Col5| | ID|Col102| +----+------+-------------+ +----+------+ +----+------+ +----+

Hadoop handling data skew in reducer

青春壹個敷衍的年華 提交于 2019-11-29 08:59:40
Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer. Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers. Am thinking of a two step process set mapred.max.reduce.failures.percent to say 10% and let the job complete rerun the job on the failed data set by

Hadoop handling data skew in reducer

自闭症网瘾萝莉.ら 提交于 2019-11-28 02:18:45
问题 Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer. Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers. Am thinking of a two step process set mapred.max

In Apache Spark, why does RDD.union not preserve the partitioner?

我是研究僧i 提交于 2019-11-27 13:11:05
As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code: val rdd1 = sc.parallelize(1 to 50).keyBy(_ % 10) .partitionBy(new HashPartitioner(10)) val rdd2 = sc.parallelize(200 to 230).keyBy(_ % 13) val cogrouped = rdd1.cogroup(rdd2) println("cogrouped: " + cogrouped.partitioner) val unioned = rdd1.union(rdd2) println("union: " + unioned.partitioner) I see that by default cogroup() always yields an RDD with the customized partitioner, but union() doesn't, it will always