google-cloud-dataproc

Output from Dataproc Spark job in Google Cloud Logging

女生的网名这么多〃 提交于 2019-12-01 15:03:41
Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available under Dataproc->Jobs in the console. There are two reasons I would like to have the logs in Cloud Logging as well: I'd like to see the logs from the executors. Often the master log will says "executor lost" with no further detail, and it would be very useful to have some more information about what the executor is up to. Cloud Logging has nice filtering and search Currently the only output from

Output from Dataproc Spark job in Google Cloud Logging

别说谁变了你拦得住时间么 提交于 2019-12-01 14:41:16
问题 Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available under Dataproc->Jobs in the console. There are two reasons I would like to have the logs in Cloud Logging as well: I'd like to see the logs from the executors. Often the master log will says "executor lost" with no further detail, and it would be very useful to have some more information about what

Flink checkpoints to Google Cloud Storage

谁说胖子不能爱 提交于 2019-12-01 11:39:31
I am trying to configure checkpoints for flink jobs in GCS. Everything works fine if I run a test job locally (no docker and any cluster setup) but it fails with an error if I run it using docker-compose or cluster setup and deploy fat jar with jobs in flink dashboard. Any thoughts of it? Thanks! Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'gs'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded. at org.apache.flink.core.fs.FileSystem

Can't add jars pyspark in jupyter of Google DataProc

ε祈祈猫儿з 提交于 2019-12-01 10:29:50
I have a Jupyter notebook on DataProc and I need a jar to run some job. I'm aware of editting spark-defaults.conf and using the --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar to submit the job from command line - they both work well. However, if I want to directly add jar to jupyter notebook, I tried the methods below and they all fail. Method 1: import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars gs://spark-lib/bigquery/spark-bigquery-latest.jar pyspark-shell' Method 2: spark = SparkSession.builder.appName('Shakespeare WordCount')\ .config('spark.jars', 'gs://spark-lib/bigquery

Can't add jars pyspark in jupyter of Google DataProc

怎甘沉沦 提交于 2019-12-01 08:20:50
问题 I have a Jupyter notebook on DataProc and I need a jar to run some job. I'm aware of editting spark-defaults.conf and using the --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar to submit the job from command line - they both work well. However, if I want to directly add jar to jupyter notebook, I tried the methods below and they all fail. Method 1: import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars gs://spark-lib/bigquery/spark-bigquery-latest.jar pyspark-shell' Method 2: spark =

How do I install Python libraries automatically on Dataproc cluster startup?

烂漫一生 提交于 2019-12-01 05:01:27
How can I automatically install Python libraries on my Dataproc cluster when the cluster starts? This would save me the trouble of manually logging into the master and/or worker nodes to manually install the libraries I need. It would be great to also know if this automated installation could install things only on the master and not the workers. Initialization actions are the best way to do this. Initialization actions are shell scripts which are run when the cluster is created. This will let you customize the cluster, such as installing Python libraries. These scripts must be stored in

Guava version while using spark-shell

谁说胖子不能爱 提交于 2019-12-01 03:19:07
I'm trying to use the spark-cassandra-connector via spark-shell on dataproc, however I am unable to connect to my cluster. It appears that there is a version mismatch since the classpath is including a much older guava version from somewhere else, even when I specify the proper version on startup. I suspect this is likely caused by all the Hadoop dependencies put into the classpath by default. Is there anyway to have spark-shell use only the proper version of guava, without getting rid of all the Hadoop-related dataproc included jars? Relevant Data: Starting spark-shell, showing it having the

Google Cloud Dataproc configuration issues

浪子不回头ぞ 提交于 2019-12-01 01:48:57
问题 I've been encountering various issues in some Spark LDA topic modeling (mainly disassociation errors at seemingly random intervals) I've been running, which I think mainly have to do with insufficient memory allocation on my executors. This would seem to be related to problematic automatic cluster configuration. My latest attempt uses n1-standard-8 machines (8 cores, 30GB RAM) for both the master and worker nodes (6 workers, so 48 total cores). But when I look at /etc/spark/conf/spark

GCP Dataproc - configure YARN fair scheduler

北战南征 提交于 2019-11-29 16:30:30
I was trying to set up a dataproc cluster that would compute only one job (or specified max jobs) at a time and the rest would be in queue. I have found this solution, How to configure monopolistic FIFO application queue in YARN? , but as I'm always creating a new cluster, I needed to automatize this. I have added this to cluster creation: "softwareConfig": { "properties": { "yarn:yarn.resourcemanager.scheduler.class":"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler", "yarn:yarn.scheduler.fair.user-as-default-queue":"false", "yarn:yarn.scheduler.fair.allocation.file"

BigQuery connector for pyspark via Hadoop Input Format example

我是研究僧i 提交于 2019-11-29 08:25:50
问题 I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output format https://cloud.google.com/hadoop/writing-with-bigquery-connector and pyspark should be able to use this interface in order to create an RDD by using the method "newAPIHadoopRDD". http://spark.apache.org/docs/latest/api/python/pyspark.html Unfortunately, the documentation on both ends seems scarce and goes