google-cloud-dataproc

Dataproc + BigQuery examples - any available?

安稳与你 提交于 2019-11-28 10:08:59
According to the Dataproc docos , it has " native and automatic integrations with BigQuery ". I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then write the results of this analysis back to BigQuery. You may be asking "why not just do the analysis in BigQuery directly!?" - the reason is because we are creating complex statistical models, and SQL is too high level for developing them. We need something like Python or R, ergo Dataproc. Are they any Dataproc + BigQuery examples available? I

Which HBase connector for Spark 2.0 should I use?

余生长醉 提交于 2019-11-28 09:14:24
Our stack is composed of Google Data Proc (Spark 2.0) and Google BigTable (HBase 1.2.0) and I am looking for a connector working with these versions. The Spark 2.0 and the new DataSet API support is not clear to me for the connectors I have found: spark-hbase : https://github.com/apache/hbase/tree/master/hbase-spark spark-hbase-connector : https://github.com/nerdammer/spark-hbase-connector hortonworks-spark/shc : https://github.com/hortonworks-spark/shc The project is written in Scala 2.11 with SBT. Thanks for your help Update : SHC now seems to work with Spark 2 and the Table API. See https:/

use an external library in pyspark job in a Spark cluster from google-dataproc

雨燕双飞 提交于 2019-11-28 07:50:45
I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv ). So I first tested it like this: I started a ssh session with the master node of my cluster, then I input: pyspark --packages com.databricks:spark-csv_2.11:1.2.0 Then it launched a pyspark shell in which I input: df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv') df.show() And it worked. My next step is to launch this job from my main machine using the command:

Save pandas data frame as csv on to gcloud storage bucket

試著忘記壹切 提交于 2019-11-27 07:23:07
问题 from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession import gc import pandas as pd import datetime import numpy as np import sys APP_NAME = "DataFrameToCSV" spark = SparkSession\ .builder\ .appName(APP_NAME)\ .config("spark.sql.crossJoin.enabled","true")\ .getOrCreate() group_ids = [1,1,1,1,1,1,1,2,2,2,2,2,2,2] dates = ["2016-04-01","2016-04-01","2016-04-01","2016-04-20","2016-04-20","2016-04-28","2016-04-28","2016-04-05","2016-04-05","2016-04-05","2016-04-05",

How to get path to the uploaded file

本秂侑毒 提交于 2019-11-27 04:55:28
I am running an spark cluster on google cloud and I upload a configuration file with each job. What is the path to a file that is uploaded with a submit command? In the example below how can I read the file Configuration.properties before the SparkContext has been initialized? I am using Scala. gcloud dataproc jobs submit spark --cluster my-cluster --class MyJob --files config/Configuration.properties --jars my.jar Local path to a file distributed using SparkFiles mechanism ( --files argument, SparkContext.addFile ) method can be obtained using SparkFiles.get : org.apache.spark.SparkFiles.get

Dataproc + BigQuery examples - any available?

天大地大妈咪最大 提交于 2019-11-27 03:30:45
问题 According to the Dataproc docos, it has " native and automatic integrations with BigQuery ". I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then write the results of this analysis back to BigQuery. You may be asking "why not just do the analysis in BigQuery directly!?" - the reason is because we are creating complex statistical models, and SQL is too high level for developing them. We

use an external library in pyspark job in a Spark cluster from google-dataproc

和自甴很熟 提交于 2019-11-27 01:59:57
问题 I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this: I started a ssh session with the master node of my cluster, then I input: pyspark --packages com.databricks:spark-csv_2.11:1.2.0 Then it launched a pyspark shell in which I input: df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv') df

spark.sql.crossJoin.enabled for Spark 2.x

六眼飞鱼酱① 提交于 2019-11-26 21:05:44
I am using the 'preview' Google DataProc Image 1.1 with Spark 2.0.0. To complete one of my operations I have to complete a cartesian product. Since version 2.0.0 there has been a spark configuration parameter created (spark.sql.cross Join.enabled) that prohibits cartesian products and an Exception is thrown. How can I set spark.sql.crossJoin.enabled=true, preferably by using an initialization action? spark.sql.crossJoin.enabled=true For changing default values of configuration settings in Dataproc, you don't even need an init action, you can use the --properties flag when creating your cluster

How to get path to the uploaded file

☆樱花仙子☆ 提交于 2019-11-26 11:25:34
问题 I am running an spark cluster on google cloud and I upload a configuration file with each job. What is the path to a file that is uploaded with a submit command? In the example below how can I read the file Configuration.properties before the SparkContext has been initialized? I am using Scala. gcloud dataproc jobs submit spark --cluster my-cluster --class MyJob --files config/Configuration.properties --jars my.jar 回答1: Local path to a file distributed using SparkFiles mechanism ( --files

spark.sql.crossJoin.enabled for Spark 2.x

血红的双手。 提交于 2019-11-26 07:49:21
问题 I am using the \'preview\' Google DataProc Image 1.1 with Spark 2.0.0. To complete one of my operations I have to complete a cartesian product. Since version 2.0.0 there has been a spark configuration parameter created (spark.sql.cross Join.enabled) that prohibits cartesian products and an Exception is thrown. How can I set spark.sql.crossJoin.enabled=true, preferably by using an initialization action? spark.sql.crossJoin.enabled=true 回答1: For changing default values of configuration settings