Spark cartesian product

前端未结

关注

 1  743

I have to compare coordinates in order to get the distance. Therefor i load the data with sc.textFile() and make a cartesian product. There are about 2.000.000 lines in the text

相关标签:

1条回答

天命终不由人

2021-01-21 19:52

You used data.collect() in your code which basically calls all data into one machine. Depending on the memory on that machine, 2,000,000 lines of data might not fit very well.

Also, I tried to reduce the number of computations to be done by doing joins instead of using cartesian. (Please note that I just generated random numbers using numpy and that the format here may be different from what you have. Still, the main idea is the same.)

import numpy as np
from numpy import arcsin, cos, sqrt

# suppose my data consists of latlong pairs
# we will use the indices for pairing up values
data = sc.parallelize(np.random.rand(10,2)).zipWithIndex()
data = data.map(lambda (val, idx): (idx, val))

# generate pairs (e.g. if i have 3 pairs with indices [0,1,2],
# I only have to compute for distances of pairs (0,1), (0,2) & (1,2)
idxs = range(data.count())
indices = sc.parallelize([(i,j) for i in idxs for j in idxs if i < j])

# haversian func (i took the liberty of editing some parts of it)
def haversian_dist(latlong1, latlong2):
    lat1, lon1 = latlong1
    lat2, lon2 = latlong2
    p = 0.017453292519943295
    def hav(theta): return (1 - cos(p * theta))/2
    a = hav(lat2 - lat1) + cos(p * lat1)*cos(p * lat2)*hav(lon2 - lon1)
    return 12742 * arcsin(sqrt(a))

joined1 = indices.join(data).map(lambda (i, (j, val)): (j, (i, val)))
joined2 = joined1.join(data).map(lambda (j, ((i, latlong1), latlong2)): ((i,j), (latlong1, latlong2))
haversianRDD = joined2.mapValues(lambda (x, y): haversian_dist(x, y))

0 讨论(0)