google-cloud-dataproc | 易学教程

While submit job with pyspark, how to access static files upload with --files argument?

阅读更多关于 While submit job with pyspark, how to access static files upload with --files argument?

问题 for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py , I want to access the static file I uploaded. with open('test.yml') as test_file: logging.info(test_file.read()) but got the following exception: IOError: [Errno 2] No such file or directory: 'test.yml' How to access the file I uploaded? 回答1: Files distributed using SparkContext.addFile (and --files ) can be

Reading S3 data from Google's dataproc

阅读更多关于 Reading S3 data from Google's dataproc

问题 I'm running a pyspark application through Google's dataproc on a cluster I created. In one stage, the application needs to access a directory in an Amazon S3 directory. At that stage, I get the error: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively). I logged onto the headnode of the cluster and set the /etc/boto.cfg with my AWS_ACCESS

Google Cloud Dataproc Virus CrytalMiner (dr.who)

阅读更多关于 Google Cloud Dataproc Virus CrytalMiner (dr.who)

问题 After a dataproc cluster is created, many jobs are submitted automatically to ResourceManager by user dr.who. This is starving the resources of the cluster and eventually overwhelms the cluster so. There is little to no information in the logs. Is anyone else experiencing this issue in dataproc? 回答1: Without knowing more, here is what I suspect is going on. It sounds like your cluster has been compromised Your firewall (network) rules are likely open, allowing any traffic into the cluster

Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

阅读更多关于 Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

问题 I am using Spark streaming on Google Cloud Dataproc for executing a framework (written in Python) which consists of several continuous pipelines, each representing a single job on Dataproc, which basically read from Kafka queues and write the transformed output to Bigtable. All pipelines combined handle several gigabytes of data per day via 2 clusters, one with 3 worker nodes and one with 4. Running this Spark streaming framework on top of Dataproc has been fairly stable until the beginning

Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

阅读更多关于 Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

Add conf file to classpath in Google Dataproc

阅读更多关于 Add conf file to classpath in Google Dataproc

问题 We're building a Spark application in Scala with a HOCON configuration, the config is called application.conf . If I add the application.conf to my jar file and start a job on Google Dataproc, it works correctly: gcloud dataproc jobs submit spark \ --cluster <clustername> \ --jar=gs://<bucketname>/<filename>.jar \ --region=<myregion> \ -- \ <some options> I don't want to bundle the application.conf with my jar file but provide it separately, which I can't get working. Tried different things,

Add conf file to classpath in Google Dataproc

阅读更多关于 Add conf file to classpath in Google Dataproc

How to keep Google Dataproc master running?

阅读更多关于 How to keep Google Dataproc master running?

问题 I created a cluster on Dataproc and it works great. However, after the cluster is idle for a while (~90 min), the master node will automatically stops. This happens to every cluster I created. I see there is a similar question here: Keep running Dataproc Master node It looks like it's the initialization action problem. However the post does not give me enough info to fix the issue. Below are the commands I used to create the cluster: gcloud dataproc clusters create $CLUSTER_NAME \ --project

Dynamic port forwarding fails after turning off and on Google Cloud virtual machine (compute engine)

阅读更多关于 Dynamic port forwarding fails after turning off and on Google Cloud virtual machine (compute engine)

问题 I'm connecting to my Spark cluster master node with dynamic port forwarding so that I can open jupyter notebook web interface in my local machine. I followed the instructions from this Google Cloud Dataproc tutorial: https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook I created ssh funnel with the following command as advised: gcloud compute ssh --zone=<cluster-zone> --ssh-flag="-D" --ssh-flag="10000" --ssh-flag="-N" "<cluster-name>-m" And opened web interface: <browser

PySpark reduceByKey causes out of memory

阅读更多关于 PySpark reduceByKey causes out of memory

问题 I'm trying to run a job on Yarn mode that processes a large amount of data (2TB) read from google cloud storage. My pipeline works just fine with 10GB of data. The specs of my cluster and the beginning of my pipeline is detailed here : PySpark Yarn Application fails on groupBy Here is the rest of the pipeline : input.groupByKey()\ [...] processing on sorted groups for each key shard .mapPartitions(sendPartition)\ .map(mergeShardsbyKey) .reduceByKey(lambda list1, list2: list1 + list2).take(10)