问题
I define the following code in order to load a pretrained embedding model:
import gensim
from gensim.models.fasttext import FastText as FT_gensim
import numpy as np
class Loader(object):
cache = {}
emb_dic = {}
count = 0
def __init__(self, filename):
print("|-------------------------------------|")
print ("Welcome to Loader class in python")
print("|-------------------------------------|")
self.fn = filename
@property
def fasttext(self):
if Loader.count == 1:
print("already loaded")
if self.fn not in Loader.cache:
Loader.cache[self.fn] = FT_gensim.load_fasttext_format(self.fn)
Loader.count = Loader.count + 1
return Loader.cache[self.fn]
def map(self, word):
if word not in self.fasttext:
Loader.emb_dic[word] = np.random.uniform(low = 0.0, high = 1.0, size = 300)
return Loader.emb_dic[word]
return self.fasttext[word]
i call this class like :
inputRaw = sc.textFile(inputFile, 3).map(lambda line: (line.split("\t")[0], line.split("\t")[1])).map(Loader(modelpath).map)
- Im confusing on How many times the modelpath file will be loaded? I want to be one time loaded per executor and used by all of its cores. My answer for this question is the modelpath will be loades 3 times (=number of partition.). If my answer is right, the disadvantage of such modeling is related to size of file modelpath. Suppose this file is 10 gb and suppose i have 200 partitions. Thus in this case we will need 10*200gb = 2000 with is huge (This solution can only work with low number of partitions.)
Suppose i have an
rdd =(id, sentence) =[(id1, u'patina californian'), (id2, u'virgil american'), (id3', u'frensh'), (id4, u'american')]
and i want to sumup the embedding word vectors for each sentence:
def test(document):
print("document is = {}".format(document))
documentWords = document.split(" ")
features = np.zeros(300)
for word in documentWords:
features = np.add(features, Loader(modelpath).fasttext[word])
return features
def calltest(inputRawSource):
my_rdd = inputRawSource.map(lambda line: (line[0], test(line[1]))).cache()
return my_rdd
In this case how many times the modelpath file will be loaded? Note that i set spark.executor.instances" to 3
回答1:
By default, the number of partitions is set to the total number of cores on all the executer nodes in the Spark cluster. Suppose you are processing 10 GB on a Spark cluster (or supercomputing executor) that contains a total of 200 CPU cores, that means Spark might use 200 partitions, by default, to process your data.
Also, to make all your CPU cores work per each executer this can be solved in python (using 100% of all cores with the multiprocessing module).
来源:https://stackoverflow.com/questions/54540970/how-to-load-a-file-in-each-executor-once