How to load a file in each executor once?

问题

I define the following code in order to load a pretrained embedding model:

import gensim

from gensim.models.fasttext import FastText as FT_gensim
import numpy as np

class Loader(object):
    cache = {}
    emb_dic = {}
    count = 0
    def __init__(self, filename):
        print("|-------------------------------------|")
        print ("Welcome to Loader class in python")
        print("|-------------------------------------|")
        self.fn = filename

    @property
    def fasttext(self):
        if Loader.count == 1:
                print("already loaded")
        if self.fn not in Loader.cache:
            Loader.cache[self.fn] =  FT_gensim.load_fasttext_format(self.fn)
            Loader.count = Loader.count + 1
        return Loader.cache[self.fn]


    def map(self, word):
        if word not in self.fasttext:
            Loader.emb_dic[word] = np.random.uniform(low = 0.0, high = 1.0, size = 300)
            return Loader.emb_dic[word]
        return self.fasttext[word]

i call this class like :

inputRaw = sc.textFile(inputFile, 3).map(lambda line: (line.split("\t")[0], line.split("\t")[1])).map(Loader(modelpath).map)

Im confusing on How many times the modelpath file will be loaded? I want to be one time loaded per executor and used by all of its cores. My answer for this question is the modelpath will be loades 3 times (=number of partition.). If my answer is right, the disadvantage of such modeling is related to size of file modelpath. Suppose this file is 10 gb and suppose i have 200 partitions. Thus in this case we will need 10*200gb = 2000 with is huge (This solution can only work with low number of partitions.)

Suppose i have an rdd =(id, sentence) =[(id1, u'patina californian'), (id2, u'virgil american'), (id3', u'frensh'), (id4, u'american')]

and i want to sumup the embedding word vectors for each sentence:

def test(document):
    print("document is = {}".format(document))
    documentWords = document.split(" ")
    features = np.zeros(300)
    for word in documentWords:
        features = np.add(features, Loader(modelpath).fasttext[word])
    return features

def calltest(inputRawSource):

    my_rdd = inputRawSource.map(lambda line: (line[0], test(line[1]))).cache()
    return my_rdd

In this case how many times the modelpath file will be loaded? Note that i set spark.executor.instances" to 3

回答1:

By default, the number of partitions is set to the total number of cores on all the executer nodes in the Spark cluster. Suppose you are processing 10 GB on a Spark cluster (or supercomputing executor) that contains a total of 200 CPU cores, that means Spark might use 200 partitions, by default, to process your data.

Also, to make all your CPU cores work per each executer this can be solved in python (using 100% of all cores with the multiprocessing module).

来源：https://stackoverflow.com/questions/54540970/how-to-load-a-file-in-each-executor-once

标签

apache-spark

pyspark

fasttext