Spark cartesian product

前端 未结 1 743
故里飘歌
故里飘歌 2021-01-21 19:03

I have to compare coordinates in order to get the distance. Therefor i load the data with sc.textFile() and make a cartesian product. There are about 2.000.000 lines in the text

相关标签:
1条回答
  • 2021-01-21 19:52

    You used data.collect() in your code which basically calls all data into one machine. Depending on the memory on that machine, 2,000,000 lines of data might not fit very well.

    Also, I tried to reduce the number of computations to be done by doing joins instead of using cartesian. (Please note that I just generated random numbers using numpy and that the format here may be different from what you have. Still, the main idea is the same.)

    import numpy as np
    from numpy import arcsin, cos, sqrt
    
    # suppose my data consists of latlong pairs
    # we will use the indices for pairing up values
    data = sc.parallelize(np.random.rand(10,2)).zipWithIndex()
    data = data.map(lambda (val, idx): (idx, val))
    
    # generate pairs (e.g. if i have 3 pairs with indices [0,1,2],
    # I only have to compute for distances of pairs (0,1), (0,2) & (1,2)
    idxs = range(data.count())
    indices = sc.parallelize([(i,j) for i in idxs for j in idxs if i < j])
    
    # haversian func (i took the liberty of editing some parts of it)
    def haversian_dist(latlong1, latlong2):
        lat1, lon1 = latlong1
        lat2, lon2 = latlong2
        p = 0.017453292519943295
        def hav(theta): return (1 - cos(p * theta))/2
        a = hav(lat2 - lat1) + cos(p * lat1)*cos(p * lat2)*hav(lon2 - lon1)
        return 12742 * arcsin(sqrt(a))
    
    joined1 = indices.join(data).map(lambda (i, (j, val)): (j, (i, val)))
    joined2 = joined1.join(data).map(lambda (j, ((i, latlong1), latlong2)): ((i,j), (latlong1, latlong2))
    haversianRDD = joined2.mapValues(lambda (x, y): haversian_dist(x, y))
    
    0 讨论(0)
提交回复
热议问题