google-cloud-dataproc

Move data from hive tables in Google Dataproc to BigQuery

◇◆丶佛笑我妖孽 提交于 2019-12-13 02:56:12
问题 We are doing the data transformations using Google Dataproc and all our data is residing in Dataproc Hive tables. How do i transfer/move this data to BigQuery. 回答1: Transfer to BigQuery from Hive seems to have a standard pattern: dump your Hive into Avro files Load those files in BigQuery See an example here: Migrate hive table to Google BigQuery As mentioned above, take care about the types compatibility between Hive/Avro/BigQuery. And for the first time I guess it would not hurt to do some

Spark UI available on Dataproc Cluster?

你。 提交于 2019-12-13 02:49:30
问题 Looking to interact with the traditional Spark Web GUI on default clusters in Dataproc. 回答1: This can be done by creating a SSH tunnel to the Dataproc master node. By using a SOCKS proxy you can then access all the applications running on YARN including your Spark sessions. This guide will walk you through in detail: Dataproc Cluster web interfaces 来源: https://stackoverflow.com/questions/44248567/spark-ui-available-on-dataproc-cluster

How can I load data that can't be pickled in each Spark executor?

浪子不回头ぞ 提交于 2019-12-13 01:43:47
问题 I'm using the NoAho library which is written in Cython. Its internal trie cannot be pickled: if I load it on the master node, I never get matches for operations that execute in workers. Since I would like to use the same trie in each Spark executor, I found a way to load the trie lazily, inspired by this spaCy on Spark issue. global trie def get_match(text): # 1. Load trie if needed global trie try: trie except NameError: from noaho import NoAho trie = NoAho() trie.add(key_text='ms windows',

Consume GCS files based on pattern from Flink

空扰寡人 提交于 2019-12-12 18:24:29
问题 Since Flink supports the Hadoop FileSystem abstraction, and there's a GCS connector - library that implements it on top of Google Cloud Storage. How do I create a Flink file source using the code in this repo? 回答1: To achieve this you need to: Install and configure GCS connector on your Flink cluster. Add Hadoop and Flink dependencies (including HDFS connector) to your project: <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-scala_2.11</artifactId> <version>${flink.version}

Issue querying a Hive table in Datalab

流过昼夜 提交于 2019-12-12 15:52:20
问题 I have create a dataproc cluster with an updated init action to install datalab. All works fine, except that when I query a Hive table from the Datalab notebook, i run into hc.sql(“””select * from invoices limit 10”””) "java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found" exception Create cluster gcloud beta dataproc clusters create ds-cluster \ --project my-exercise-project \ --region us-west1 \ --zone us-west1-b \ --bucket dataproc-datalab

Creating a cluster before sending a job to dataproc programmatically

╄→гoц情女王★ 提交于 2019-12-12 12:49:45
问题 I'm trying to schedule a PySpark Job. I followed the GCP documentation and ended up deploying a little python script to App Engine which does the following : authenticate using a service account submit a job to a cluster The problem is, I need the cluster to be up and running otherwise the job won't be sent (duh !) but I don't want the cluster to always be up and running, especially since my job needs to run once a month. I wanted to add the creation of a cluster in my python script but the

Pyspark application only partly exploits dataproc cluster resources

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-12 12:14:52
问题 My pyspark application runs a UDF over a 106,36 MB dataset (817.270 records), which takes about 100hours with regular python lambda functions. I have spawned a Google Dataproc cluster with 20 worker nodes with 8 vCPU's each. However, upon execution only 3 nodes and 3 vCPU's in total are used. Obviously, I would like the cluster to use all the resources that I make available. The default number of partitions of my resulting dataframe is 8. I tried repartitioning it to 100 but the cluster keeps

Google DataProc API spark cluster with c#

假装没事ソ 提交于 2019-12-12 05:45:46
问题 I have data in Big Query I want to run analytics on in a spark cluster. Per documentation if I instantiate a spark cluster it should come with a Big Query connector. I was looking for any sample code to do this, found one in pyspark. Could not find any c# examples. Also found some documentation on the functions in DataProc APIs nuget package. Looking for a sample to start a spark cluster in Google cloud using c#. 回答1: After installing Google.Apis.Dataproc.v1 version 1.10.0.40 (or higher):

How to use params/properties flag values when executing hive job on google dataproc

断了今生、忘了曾经 提交于 2019-12-12 04:47:53
问题 I am trying to execute a hive job in google dataproc using following gcloud command: gcloud dataproc jobs submit hive --cluster=msm-test-cluster --file hive.sql --properties=[bucket1=abcd] gcloud dataproc jobs submit hive --cluster=msm-test-cluster --file hive.sql --params=[bucket1=abcd] But none of the 2 above commands is able to set 'bucket1' variable to 'x' variable . The hive script is as follows: set x=${bucket1}; set x; drop table T1; create external table T1( column1 bigint, column2

How I dynamically upgrade worker's cpu/ram/disk in dataproc?

假如想象 提交于 2019-12-12 03:36:35
问题 I created a cluster by default setting(4 vCPUs, 15GB Ram) in google dataproc. After working several pig jobs, the cluster had 2-3 unhealthy node. So I upgraded the worker VM's vCPUs(4 to 8 vCPUs), Ram(15GB to 30GB) and Disk. But in the Hadoop Web interface showed the hardware of worker node didn't change, it still showed the original mounts of vCPU/Ram/Disk. How can I dynamically upgrade worker's cpu/ram/disk in dataproc? Thanks. 回答1: Dataproc has no support for upgrading workers on running