google-cloud-dataproc | 易学教程

Move data from hive tables in Google Dataproc to BigQuery

阅读更多关于 Move data from hive tables in Google Dataproc to BigQuery

问题 We are doing the data transformations using Google Dataproc and all our data is residing in Dataproc Hive tables. How do i transfer/move this data to BigQuery. 回答1: Transfer to BigQuery from Hive seems to have a standard pattern: dump your Hive into Avro files Load those files in BigQuery See an example here: Migrate hive table to Google BigQuery As mentioned above, take care about the types compatibility between Hive/Avro/BigQuery. And for the first time I guess it would not hurt to do some

Spark UI available on Dataproc Cluster?

阅读更多关于 Spark UI available on Dataproc Cluster?

问题 Looking to interact with the traditional Spark Web GUI on default clusters in Dataproc. 回答1: This can be done by creating a SSH tunnel to the Dataproc master node. By using a SOCKS proxy you can then access all the applications running on YARN including your Spark sessions. This guide will walk you through in detail: Dataproc Cluster web interfaces 来源： https://stackoverflow.com/questions/44248567/spark-ui-available-on-dataproc-cluster

How can I load data that can't be pickled in each Spark executor?

阅读更多关于 How can I load data that can't be pickled in each Spark executor?

问题 I'm using the NoAho library which is written in Cython. Its internal trie cannot be pickled: if I load it on the master node, I never get matches for operations that execute in workers. Since I would like to use the same trie in each Spark executor, I found a way to load the trie lazily, inspired by this spaCy on Spark issue. global trie def get_match(text): # 1. Load trie if needed global trie try: trie except NameError: from noaho import NoAho trie = NoAho() trie.add(key_text='ms windows',

Consume GCS files based on pattern from Flink

阅读更多关于 Consume GCS files based on pattern from Flink

问题 Since Flink supports the Hadoop FileSystem abstraction, and there's a GCS connector - library that implements it on top of Google Cloud Storage. How do I create a Flink file source using the code in this repo? 回答1: To achieve this you need to: Install and configure GCS connector on your Flink cluster. Add Hadoop and Flink dependencies (including HDFS connector) to your project: <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-scala_2.11</artifactId> <version>${flink.version}

Issue querying a Hive table in Datalab

阅读更多关于 Issue querying a Hive table in Datalab

问题 I have create a dataproc cluster with an updated init action to install datalab. All works fine, except that when I query a Hive table from the Datalab notebook, i run into hc.sql(“””select * from invoices limit 10”””) "java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found" exception Create cluster gcloud beta dataproc clusters create ds-cluster \ --project my-exercise-project \ --region us-west1 \ --zone us-west1-b \ --bucket dataproc-datalab

Creating a cluster before sending a job to dataproc programmatically

阅读更多关于 Creating a cluster before sending a job to dataproc programmatically

问题 I'm trying to schedule a PySpark Job. I followed the GCP documentation and ended up deploying a little python script to App Engine which does the following : authenticate using a service account submit a job to a cluster The problem is, I need the cluster to be up and running otherwise the job won't be sent (duh !) but I don't want the cluster to always be up and running, especially since my job needs to run once a month. I wanted to add the creation of a cluster in my python script but the

Pyspark application only partly exploits dataproc cluster resources

阅读更多关于 Pyspark application only partly exploits dataproc cluster resources

问题 My pyspark application runs a UDF over a 106,36 MB dataset (817.270 records), which takes about 100hours with regular python lambda functions. I have spawned a Google Dataproc cluster with 20 worker nodes with 8 vCPU's each. However, upon execution only 3 nodes and 3 vCPU's in total are used. Obviously, I would like the cluster to use all the resources that I make available. The default number of partitions of my resulting dataframe is 8. I tried repartitioning it to 100 but the cluster keeps

Google DataProc API spark cluster with c#

阅读更多关于 Google DataProc API spark cluster with c#

问题 I have data in Big Query I want to run analytics on in a spark cluster. Per documentation if I instantiate a spark cluster it should come with a Big Query connector. I was looking for any sample code to do this, found one in pyspark. Could not find any c# examples. Also found some documentation on the functions in DataProc APIs nuget package. Looking for a sample to start a spark cluster in Google cloud using c#. 回答1: After installing Google.Apis.Dataproc.v1 version 1.10.0.40 (or higher):

How to use params/properties flag values when executing hive job on google dataproc

阅读更多关于 How to use params/properties flag values when executing hive job on google dataproc

问题 I am trying to execute a hive job in google dataproc using following gcloud command: gcloud dataproc jobs submit hive --cluster=msm-test-cluster --file hive.sql --properties=[bucket1=abcd] gcloud dataproc jobs submit hive --cluster=msm-test-cluster --file hive.sql --params=[bucket1=abcd] But none of the 2 above commands is able to set 'bucket1' variable to 'x' variable . The hive script is as follows: set x=${bucket1}; set x; drop table T1; create external table T1( column1 bigint, column2

How I dynamically upgrade worker's cpu/ram/disk in dataproc?

阅读更多关于 How I dynamically upgrade worker's cpu/ram/disk in dataproc?

问题 I created a cluster by default setting(4 vCPUs, 15GB Ram) in google dataproc. After working several pig jobs, the cluster had 2-3 unhealthy node. So I upgraded the worker VM's vCPUs(4 to 8 vCPUs), Ram(15GB to 30GB) and Disk. But in the Hadoop Web interface showed the hardware of worker node didn't change, it still showed the original mounts of vCPU/Ram/Disk. How can I dynamically upgrade worker's cpu/ram/disk in dataproc? Thanks. 回答1: Dataproc has no support for upgrading workers on running