google-cloud-dataproc

While submit job with pyspark, how to access static files upload with --files argument?

廉价感情. 提交于 2020-02-16 11:31:06
问题 for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py , I want to access the static file I uploaded. with open('test.yml') as test_file: logging.info(test_file.read()) but got the following exception: IOError: [Errno 2] No such file or directory: 'test.yml' How to access the file I uploaded? 回答1: Files distributed using SparkContext.addFile (and --files ) can be

Reading S3 data from Google's dataproc

只愿长相守 提交于 2020-02-05 04:07:05
问题 I'm running a pyspark application through Google's dataproc on a cluster I created. In one stage, the application needs to access a directory in an Amazon S3 directory. At that stage, I get the error: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively). I logged onto the headnode of the cluster and set the /etc/boto.cfg with my AWS_ACCESS

Google Cloud Dataproc Virus CrytalMiner (dr.who)

北城以北 提交于 2020-01-30 08:14:23
问题 After a dataproc cluster is created, many jobs are submitted automatically to ResourceManager by user dr.who. This is starving the resources of the cluster and eventually overwhelms the cluster so. There is little to no information in the logs. Is anyone else experiencing this issue in dataproc? 回答1: Without knowing more, here is what I suspect is going on. It sounds like your cluster has been compromised Your firewall (network) rules are likely open, allowing any traffic into the cluster

Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

不打扰是莪最后的温柔 提交于 2020-01-14 19:26:27
问题 I am using Spark streaming on Google Cloud Dataproc for executing a framework (written in Python) which consists of several continuous pipelines, each representing a single job on Dataproc, which basically read from Kafka queues and write the transformed output to Bigtable. All pipelines combined handle several gigabytes of data per day via 2 clusters, one with 3 worker nodes and one with 4. Running this Spark streaming framework on top of Dataproc has been fairly stable until the beginning

Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

白昼怎懂夜的黑 提交于 2020-01-14 19:25:13
问题 I am using Spark streaming on Google Cloud Dataproc for executing a framework (written in Python) which consists of several continuous pipelines, each representing a single job on Dataproc, which basically read from Kafka queues and write the transformed output to Bigtable. All pipelines combined handle several gigabytes of data per day via 2 clusters, one with 3 worker nodes and one with 4. Running this Spark streaming framework on top of Dataproc has been fairly stable until the beginning

Add conf file to classpath in Google Dataproc

江枫思渺然 提交于 2020-01-11 11:13:28
问题 We're building a Spark application in Scala with a HOCON configuration, the config is called application.conf . If I add the application.conf to my jar file and start a job on Google Dataproc, it works correctly: gcloud dataproc jobs submit spark \ --cluster <clustername> \ --jar=gs://<bucketname>/<filename>.jar \ --region=<myregion> \ -- \ <some options> I don't want to bundle the application.conf with my jar file but provide it separately, which I can't get working. Tried different things,

Add conf file to classpath in Google Dataproc

流过昼夜 提交于 2020-01-11 11:12:07
问题 We're building a Spark application in Scala with a HOCON configuration, the config is called application.conf . If I add the application.conf to my jar file and start a job on Google Dataproc, it works correctly: gcloud dataproc jobs submit spark \ --cluster <clustername> \ --jar=gs://<bucketname>/<filename>.jar \ --region=<myregion> \ -- \ <some options> I don't want to bundle the application.conf with my jar file but provide it separately, which I can't get working. Tried different things,

How to keep Google Dataproc master running?

怎甘沉沦 提交于 2020-01-05 14:34:19
问题 I created a cluster on Dataproc and it works great. However, after the cluster is idle for a while (~90 min), the master node will automatically stops. This happens to every cluster I created. I see there is a similar question here: Keep running Dataproc Master node It looks like it's the initialization action problem. However the post does not give me enough info to fix the issue. Below are the commands I used to create the cluster: gcloud dataproc clusters create $CLUSTER_NAME \ --project

Dynamic port forwarding fails after turning off and on Google Cloud virtual machine (compute engine)

会有一股神秘感。 提交于 2020-01-05 07:35:31
问题 I'm connecting to my Spark cluster master node with dynamic port forwarding so that I can open jupyter notebook web interface in my local machine. I followed the instructions from this Google Cloud Dataproc tutorial: https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook I created ssh funnel with the following command as advised: gcloud compute ssh --zone=<cluster-zone> --ssh-flag="-D" --ssh-flag="10000" --ssh-flag="-N" "<cluster-name>-m" And opened web interface: <browser

PySpark reduceByKey causes out of memory

百般思念 提交于 2020-01-01 19:38:11
问题 I'm trying to run a job on Yarn mode that processes a large amount of data (2TB) read from google cloud storage. My pipeline works just fine with 10GB of data. The specs of my cluster and the beginning of my pipeline is detailed here : PySpark Yarn Application fails on groupBy Here is the rest of the pipeline : input.groupByKey()\ [...] processing on sorted groups for each key shard .mapPartitions(sendPartition)\ .map(mergeShardsbyKey) .reduceByKey(lambda list1, list2: list1 + list2).take(10)