google-cloud-dataproc | 易学教程

Output from Dataproc Spark job in Google Cloud Logging

阅读更多关于 Output from Dataproc Spark job in Google Cloud Logging

Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available under Dataproc->Jobs in the console. There are two reasons I would like to have the logs in Cloud Logging as well: I'd like to see the logs from the executors. Often the master log will says "executor lost" with no further detail, and it would be very useful to have some more information about what the executor is up to. Cloud Logging has nice filtering and search Currently the only output from

Output from Dataproc Spark job in Google Cloud Logging

阅读更多关于 Output from Dataproc Spark job in Google Cloud Logging

问题 Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available under Dataproc->Jobs in the console. There are two reasons I would like to have the logs in Cloud Logging as well: I'd like to see the logs from the executors. Often the master log will says "executor lost" with no further detail, and it would be very useful to have some more information about what

Flink checkpoints to Google Cloud Storage

阅读更多关于 Flink checkpoints to Google Cloud Storage

I am trying to configure checkpoints for flink jobs in GCS. Everything works fine if I run a test job locally (no docker and any cluster setup) but it fails with an error if I run it using docker-compose or cluster setup and deploy fat jar with jobs in flink dashboard. Any thoughts of it? Thanks! Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'gs'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded. at org.apache.flink.core.fs.FileSystem

Can't add jars pyspark in jupyter of Google DataProc

阅读更多关于 Can't add jars pyspark in jupyter of Google DataProc

I have a Jupyter notebook on DataProc and I need a jar to run some job. I'm aware of editting spark-defaults.conf and using the --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar to submit the job from command line - they both work well. However, if I want to directly add jar to jupyter notebook, I tried the methods below and they all fail. Method 1: import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars gs://spark-lib/bigquery/spark-bigquery-latest.jar pyspark-shell' Method 2: spark = SparkSession.builder.appName('Shakespeare WordCount')\ .config('spark.jars', 'gs://spark-lib/bigquery

Can't add jars pyspark in jupyter of Google DataProc

阅读更多关于 Can't add jars pyspark in jupyter of Google DataProc

问题 I have a Jupyter notebook on DataProc and I need a jar to run some job. I'm aware of editting spark-defaults.conf and using the --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar to submit the job from command line - they both work well. However, if I want to directly add jar to jupyter notebook, I tried the methods below and they all fail. Method 1: import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars gs://spark-lib/bigquery/spark-bigquery-latest.jar pyspark-shell' Method 2: spark =

How do I install Python libraries automatically on Dataproc cluster startup?

阅读更多关于 How do I install Python libraries automatically on Dataproc cluster startup?

How can I automatically install Python libraries on my Dataproc cluster when the cluster starts? This would save me the trouble of manually logging into the master and/or worker nodes to manually install the libraries I need. It would be great to also know if this automated installation could install things only on the master and not the workers. Initialization actions are the best way to do this. Initialization actions are shell scripts which are run when the cluster is created. This will let you customize the cluster, such as installing Python libraries. These scripts must be stored in

Guava version while using spark-shell

阅读更多关于 Guava version while using spark-shell

I'm trying to use the spark-cassandra-connector via spark-shell on dataproc, however I am unable to connect to my cluster. It appears that there is a version mismatch since the classpath is including a much older guava version from somewhere else, even when I specify the proper version on startup. I suspect this is likely caused by all the Hadoop dependencies put into the classpath by default. Is there anyway to have spark-shell use only the proper version of guava, without getting rid of all the Hadoop-related dataproc included jars? Relevant Data: Starting spark-shell, showing it having the

Google Cloud Dataproc configuration issues

阅读更多关于 Google Cloud Dataproc configuration issues

问题 I've been encountering various issues in some Spark LDA topic modeling (mainly disassociation errors at seemingly random intervals) I've been running, which I think mainly have to do with insufficient memory allocation on my executors. This would seem to be related to problematic automatic cluster configuration. My latest attempt uses n1-standard-8 machines (8 cores, 30GB RAM) for both the master and worker nodes (6 workers, so 48 total cores). But when I look at /etc/spark/conf/spark

GCP Dataproc - configure YARN fair scheduler

阅读更多关于 GCP Dataproc - configure YARN fair scheduler

I was trying to set up a dataproc cluster that would compute only one job (or specified max jobs) at a time and the rest would be in queue. I have found this solution, How to configure monopolistic FIFO application queue in YARN? , but as I'm always creating a new cluster, I needed to automatize this. I have added this to cluster creation: "softwareConfig": { "properties": { "yarn:yarn.resourcemanager.scheduler.class":"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler", "yarn:yarn.scheduler.fair.user-as-default-queue":"false", "yarn:yarn.scheduler.fair.allocation.file"

BigQuery connector for pyspark via Hadoop Input Format example

阅读更多关于 BigQuery connector for pyspark via Hadoop Input Format example

问题 I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output format https://cloud.google.com/hadoop/writing-with-bigquery-connector and pyspark should be able to use this interface in order to create an RDD by using the method "newAPIHadoopRDD". http://spark.apache.org/docs/latest/api/python/pyspark.html Unfortunately, the documentation on both ends seems scarce and goes