According to the Dataproc docos, it has "native and automatic integrations with BigQuery".
I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then write the results of this analysis back to BigQuery. You may be asking "why not just do the analysis in BigQuery directly!?" - the reason is because we are creating complex statistical models, and SQL is too high level for developing them. We need something like Python or R, ergo Dataproc.
Are they any Dataproc + BigQuery examples available? I can't find any.
To begin, as noted in this question the BigQuery connector is preinstalled on Cloud Dataproc clusters.
Here is an example on how to read data from BigQuery into Spark. In this example, we will read data from BigQuery to perform a word count.
You read data from BigQuery in Spark using SparkContext.newAPIHadoopRDD
. The Spark documentation has more information about using SparkContext.newAPIHadoopRDD
. '
import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
import com.google.gson.JsonObject
import org.apache.hadoop.io.LongWritable
val projectId = "<your-project-id>"
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"
val outputTableSchema =
"[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"
val jobName = "wordcount"
val conf = sc.hadoopConfiguration
// Set the job-level projectId.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
// Use the systemBucket for temporary BigQuery export data used by the InputFormat.
val systemBucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)
// Configure input and output for BigQuery access.
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
BigQueryConfiguration.configureBigQueryOutput(conf,
fullyQualifiedOutputTableId, outputTableSchema)
val fieldName = "word"
val tableData = sc.newAPIHadoopRDD(conf,
classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])
tableData.cache()
tableData.count()
tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)
You will need to customize this example with your settings, including your Cloud Platform project ID in <your-project-id>
and your output table ID in <your-fully-qualified-table-id>
.
Finally, if you end up using the BigQuery connector with MapReduce, this page has examples for how to write MapReduce jobs with the BigQuery connector.
You can also use spark-bigquery connectors https://github.com/samelamin/spark-bigquery to directly run your queries on dataproc using spark.
The above example doesn't show how to write data to an output table. You need to do this:
.saveAsNewAPIHadoopFile(
hadoopConf.get(BigQueryConfiguration.TEMP_GCS_PATH_KEY),
classOf[String],
classOf[JsonObject],
classOf[BigQueryOutputFormat[String, JsonObject]], hadoopConf)
where the key: String is actually ignored
来源:https://stackoverflow.com/questions/32960707/dataproc-bigquery-examples-any-available