MapReduce

Spark computation for suggesting new friendships

南笙酒味 提交于 2021-02-08 07:17:41
问题 I'm using Spark for fun and to learn new things about MapReduce. So, I'm trying to write a program suggesting new friendships (i.e., a sort of recommendation system). The suggestion of a friendship between two individuals is performed if they are not connected yet and have a lot of friends in common. The friendship text file has a structure similar to the following: 1 2,4,11,12,15 2 1,3,4,5,9,10 3 2,5,11,15,20,21 4 1,2,3 5 2,3,4,15,16 ... where the syntax is: ID_SRC1<TAB>ID_DST1,ID_DST2,... .

mongo db aggregate randomize ( shuffle ) results

为君一笑 提交于 2021-02-08 04:40:58
问题 I was going thru bunch of mongo docs and can't find a possibility to shuffle or randomize result content is there any ? 回答1: Specifically for the aggregation framework itself there is not really any native way as there is no available operator as yet to do something like generate a random number. So whatever match you could possibly project a field to sort on would not be "truly random" for lack of a shifting seed value. The better approach is to "shuffle" the results as an array after the

How to run MapReduce tasks in Parallel with hadoop 2.x?

喜夏-厌秋 提交于 2021-02-07 19:09:58
问题 I would like my map and reduce tasks to run in parallel. However, despite trying every trick in the bag, they are still running sequentially. I read from How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce, that using the following formula, one can set the number of tasks running in parallel. min (yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb, yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu

What determines the number of mappers/reducers to use given a specified set of data [closed]

情到浓时终转凉″ 提交于 2021-02-07 10:35:30
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 8 years ago . What are the factors which decide the number of mappers and reducers to use for a given set of data to achieve optimal performance? I

Escape quotes is not working in spark 2.2.0 while reading csv

断了今生、忘了曾经 提交于 2021-02-07 10:34:18
问题 I am trying to read my delimited file which is tab separated but not able to read all records. Here is my input records: head1 head2 head3 a b c a2 a3 a4 a1 "b1 "c1 My code: var inputDf = sparkSession.read .option("delimiter","\t") .option("header", "true") // .option("inferSchema", "true") .option("nullValue", "") .option("escape","\"") .option("multiLine", true) .option("nullValue", null) .option("nullValue", "NULL") .schema(finalSchema) .csv("file:///C:/Users/prhasija/Desktop

Replace groupByKey with reduceByKey in Spark

大憨熊 提交于 2021-02-07 04:28:19
问题 Hello I often need to use groupByKey in my code but I know it's a very heavy operation. Since I'm working to improve performance I was wondering if my approach to remove all groupByKey calls is efficient. I was used to create an RDD from another RDD and creating pair of type (Int, Int) rdd1 = [(1, 2), (1, 3), (2 , 3), (2, 4), (3, 5)] and since I needed to obtain something like this: [(1, [2, 3]), (2 , [3, 4]), (3, [5])] what I used was out = rdd1.groupByKey but since this approach might be

hive与mysql对比之max、group by、日志分析

落爺英雄遲暮 提交于 2021-02-06 10:52:06
前期准备 mysql模型:test_max_date(id int,name varchar(255),num int,date date) hive模型: create table test_date_max(id int,name string,rq Date); insert into table test_date_max values (1,"1","2020-12-25"), (2,"1","2020-12-28"), (3,"2","2020-12-25"), (4,"2","2020-12-20") ; 需求 查询每个人最新状态 计算逻辑 每个人有多条数据,日期越大,状态越新 计算过程 mysql: SELECT id,name,date,max(date) from test_max_date group by name ORDER BY id hive: select name,max(rq) from test_date_max group by name; 错误信息说明:在之前的帖子中说过hive groupby的问题。 这里hive中有id,name,日期。id是主键不重复,name是可以重复的,按照name分组,对rq使用max函数,其实是对name去重,返回name每个重复值组中的最大日期 就好比一个公司分了几个部门,部门是确定的

MongoDB MapReduce - Emit one key/one value doesnt call reduce

﹥>﹥吖頭↗ 提交于 2021-02-05 18:54:49
问题 So i'm new with mongodb and mapreduce in general and came across this "quirk" (or atleast in my mind a quirk) Say I have objects in my collection like so: {'key':5, 'value':5} {'key':5, 'value':4} {'key':5, 'value':1} {'key':4, 'value':6} {'key':4, 'value':4} {'key':3, 'value':0} My map function simply emits the key and the value My reduce function simply adds the values AND before returning them adds 1 (I did this to check to see if the reduce function is even called) My results follow: {'

MongoDB MapReduce - Emit one key/one value doesnt call reduce

我的梦境 提交于 2021-02-05 18:54:11
问题 So i'm new with mongodb and mapreduce in general and came across this "quirk" (or atleast in my mind a quirk) Say I have objects in my collection like so: {'key':5, 'value':5} {'key':5, 'value':4} {'key':5, 'value':1} {'key':4, 'value':6} {'key':4, 'value':4} {'key':3, 'value':0} My map function simply emits the key and the value My reduce function simply adds the values AND before returning them adds 1 (I did this to check to see if the reduce function is even called) My results follow: {'

MongoDB MapReduce - Emit one key/one value doesnt call reduce

自闭症网瘾萝莉.ら 提交于 2021-02-05 18:52:12
问题 So i'm new with mongodb and mapreduce in general and came across this "quirk" (or atleast in my mind a quirk) Say I have objects in my collection like so: {'key':5, 'value':5} {'key':5, 'value':4} {'key':5, 'value':1} {'key':4, 'value':6} {'key':4, 'value':4} {'key':3, 'value':0} My map function simply emits the key and the value My reduce function simply adds the values AND before returning them adds 1 (I did this to check to see if the reduce function is even called) My results follow: {'