Downloading files from Google Storage using Spark (Python) and Dataproc

丶灬走出姿态 提交于 2019-12-13 14:08:02

问题


I have an application that parallelizes the execution of Python objects that process data to be downloaded from Google Storage (my project bucket). The cluster is created using Google Dataproc. The problem is that the data is never downloaded! I wrote a test program to try and understand the problem. I wrote the following function to copy the files from the bucket and to see if creating files on workers does work:

from subprocess import call
from os.path import join

def copyDataFromBucket(filename,remoteFolder,localFolder):
  call(["gsutil","-m","cp",join(remoteFolder,filename),localFolder]

def execTouch(filename,localFolder):
  call(["touch",join(localFolder,"touched_"+filename)])

I've tested this function by calling it from a python shell and it works. But when I run the following code using spark-submit, the files are not downloaded (but no error is raised):

# ...
filesRDD = sc.parallelize(fileList)
filesRDD.foreach(lambda myFile: copyDataFromBucket(myFile,remoteBucketFolder,'/tmp/output')
filesRDD.foreach(lambda myFile: execTouch(myFile,'/tmp/output')
# ...

The execTouch function works (I can see the files on each worker) but the copyDataFromBucket function does nothing.

So what am I doing wrong?


回答1:


The problem was clearly the Spark context. Replacing the call to "gsutil" by a call to "hadoop fs" solves it:

from subprocess import call
from os.path import join

def copyDataFromBucket(filename,remoteFolder,localFolder):
  call(["hadoop","fs","-copyToLocal",join(remoteFolder,filename),localFolder]

I also did a test to send data to the bucket. One only needs to replace "-copyToLocal" by "-copyFromLocal"



来源:https://stackoverflow.com/questions/39945687/downloading-files-from-google-storage-using-spark-python-and-dataproc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!