google-cloud-dataproc

Using non-default service account in Google Cloud dataproc

久未见 提交于 2019-12-02 08:00:38
问题 I'd like to create a dataproc cluster that runs under a non-default service account. The following works for a compute instance: gcloud compute instances create instance-1 --machine-type "n1-standard-1" --zone "europe-west1-b" --scopes xxxxxxxx@yyyyyyyy.iam.gserviceaccount.com="https://www.googleapis.com/auth/cloud-platform" But the same --scopes argument fails when creating a dataproc instance: gcloud dataproc clusters create --zone "europe-west1-b" --scopes xxxxxxxx@yyyyyyyy.iam

How to resolve Guava dependency issue while submitting Uber Jar to Google Dataproc

陌路散爱 提交于 2019-12-02 06:36:22
I am using maven shade plugin to build Uber jar for submitting it as a job to google dataproc cluster. Google have installed Apache Spark 2.0.2 Apache Hadoop 2.7.3 on their cluster. Apache spark 2.0.2 uses 14.0.1 of com.google.guava and apache hadoop 2.7.3 uses 11.0.2, these both should be in the classpath already. <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.0.0</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <!-- <artifactSet> <includes> <include>com.google.guava:guava

PySpark print to console

核能气质少年 提交于 2019-12-02 04:07:42
问题 When running a PySpark job on the dataproc server like this gcloud --project <project_name> dataproc jobs submit pyspark --cluster <cluster_name> <python_script> my print statements don't show up in my terminal. Is there any way to output data onto the terminal in PySpark when running jobs on the cloud? Edit : I would like to print/log info from within my transformation. For example: def print_funct(l): print(l) return l rddData.map(lambda l: print_funct(l)).collect() Should print every line

pyspark rdd isCheckPointed() is false

﹥>﹥吖頭↗ 提交于 2019-12-02 03:14:41
I was encountering stackoverflowerrors when I was iteratively adding over 500 columns to my pyspark dataframe. So, I included checkpoints. The checkpoints did not help. So, I created the following toy application to test if my checkpoints were working correctly. All I do in this example is iteratively create columns by copying the original column over and over again. I persist, checkpoint and count every 10 iterations. I notice that my dataframe.rdd.isCheckpointed() always returns False. I can verify that the checkpoint folders are indeed being created and populated on disk. I am running on

Google Cloud Dataproc Virus CrytalMiner (dr.who)

心不动则不痛 提交于 2019-12-02 02:57:32
After a dataproc cluster is created, many jobs are submitted automatically to ResourceManager by user dr.who. This is starving the resources of the cluster and eventually overwhelms the cluster so. There is little to no information in the logs. Is anyone else experiencing this issue in dataproc? Without knowing more, here is what I suspect is going on. It sounds like your cluster has been compromised Your firewall (network) rules are likely open, allowing any traffic into the cluster Someone has discovered your cluster is open to the public internet and is taking advantage of it I recommend

KeyError: 'SPARK_HOME' in pyspark on Jupyter on Google-Cloud-DataProc

无人久伴 提交于 2019-12-02 02:20:59
问题 When trying to show a SparkDF (Test), I get a KeyError, as shown below. Probably something goes wrong in the function I used before Test.show(3) . The KeyError says: KeyError: 'SPARK_HOME'. I assume SPARK_HOME is not defined on the master and/or workers. Is there a way I can specify the SPARK_HOME directory automatically on both? Preferably by using a initialization action. Py4JJavaErrorTraceback (most recent call last) in () ----> 1 Test.show(3) /usr/lib/spark/python/pyspark/sql/dataframe.py

PySpark print to console

二次信任 提交于 2019-12-02 01:26:14
When running a PySpark job on the dataproc server like this gcloud --project <project_name> dataproc jobs submit pyspark --cluster <cluster_name> <python_script> my print statements don't show up in my terminal. Is there any way to output data onto the terminal in PySpark when running jobs on the cloud? Edit : I would like to print/log info from within my transformation. For example: def print_funct(l): print(l) return l rddData.map(lambda l: print_funct(l)).collect() Should print every line of data in the RDD rddData . Doing some digging, I found this answer for logging , however, testing it

StackOverflow-error when applying pyspark ALS's “recommendProductsForUsers” (although cluster of >300GB Ram available)

心不动则不痛 提交于 2019-12-01 23:39:32
Looking for expertise to guide me on issue below. Background: I'm trying to get going with a basic PySpark script inspired on this example As deploy infrastructure I use a Google Cloud Dataproc Cluster. Cornerstone in my code is the function "recommendProductsForUsers" documented here which gives me back the top X products for all users in the model Issue I incur The ALS.Train script runs smoothly and scales well on GCP (Easily >1mn customers). However, applying the predictions: i.e. using funcitons 'PredictAll' or 'recommendProductsForUsers', does not scale at all. My script runs smooth for a

Running app jar file on spark-submit in a google dataproc cluster instance

谁都会走 提交于 2019-12-01 20:53:51
I'm running a .jar file that contains all dependencies that I need packaged in it. One of this dependencies is com.google.common.util.concurrent.RateLimiter and already checked it's class file is in this .jar file. Unfortunately when I hit the command spark-submit on the master node of my google's dataproc-cluster instance I'm getting this error: Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch; at com.google.common.util.concurrent.RateLimiter$SleepingStopwatch$1.<init>(RateLimiter.java:417) at com.google

Running app jar file on spark-submit in a google dataproc cluster instance

旧城冷巷雨未停 提交于 2019-12-01 20:34:31
问题 I'm running a .jar file that contains all dependencies that I need packaged in it. One of this dependencies is com.google.common.util.concurrent.RateLimiter and already checked it's class file is in this .jar file. Unfortunately when I hit the command spark-submit on the master node of my google's dataproc-cluster instance I'm getting this error: Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch; at com