I have to compare coordinates in order to get the distance. Therefor i load the data with sc.textFile() and make a cartesian product. There are about 2.000.000 lines in the text
You used data.collect()
in your code which basically calls all data into one machine. Depending on the memory on that machine, 2,000,000 lines of data might not fit very well.
Also, I tried to reduce the number of computations to be done by doing joins instead of using cartesian
. (Please note that I just generated random numbers using numpy and that the format here may be different from what you have. Still, the main idea is the same.)
import numpy as np
from numpy import arcsin, cos, sqrt
# suppose my data consists of latlong pairs
# we will use the indices for pairing up values
data = sc.parallelize(np.random.rand(10,2)).zipWithIndex()
data = data.map(lambda (val, idx): (idx, val))
# generate pairs (e.g. if i have 3 pairs with indices [0,1,2],
# I only have to compute for distances of pairs (0,1), (0,2) & (1,2)
idxs = range(data.count())
indices = sc.parallelize([(i,j) for i in idxs for j in idxs if i < j])
# haversian func (i took the liberty of editing some parts of it)
def haversian_dist(latlong1, latlong2):
lat1, lon1 = latlong1
lat2, lon2 = latlong2
p = 0.017453292519943295
def hav(theta): return (1 - cos(p * theta))/2
a = hav(lat2 - lat1) + cos(p * lat1)*cos(p * lat2)*hav(lon2 - lon1)
return 12742 * arcsin(sqrt(a))
joined1 = indices.join(data).map(lambda (i, (j, val)): (j, (i, val)))
joined2 = joined1.join(data).map(lambda (j, ((i, latlong1), latlong2)): ((i,j), (latlong1, latlong2))
haversianRDD = joined2.mapValues(lambda (x, y): haversian_dist(x, y))