问题
I have create a dataproc cluster with an updated init action to install datalab.
All works fine, except that when I query a Hive table from the Datalab notebook, i run into
hc.sql(“””select * from invoices limit 10”””)
"java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found" exception
Create cluster
gcloud beta dataproc clusters create ds-cluster \
--project my-exercise-project \
--region us-west1 \
--zone us-west1-b \
--bucket dataproc-datalab \
--scopes cloud-platform \
--num-workers 2 \
--enable-component-gateway \
--initialization-actions gs://dataproc_mybucket/datalab-updated.sh,gs://dataproc-initialization-actions/connectors/connectors.sh \
--metadata 'CONDA_PACKAGES="python==3.5"' \
--metadata gcs-connector-version=1.9.11
datalab-updated.sh
-v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
mkdir -p ${HOME}/datalab
gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks
In the datalab notebook
from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""show tables in default""").show()
hc.sql(“””CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://my-exercise-project-ds-team/datasets/invoices’”””)
hc.sql(“””select * from invoices limit 10”””)
UPDATE
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "~/Downloads/my-exercise-project-f47054fc6fd8.json")
UPDATE 2 ( datalab-updated.sh )
function run_datalab(){
if docker run -d --restart always --net=host \
-v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
mkdir -p ${HOME}/datalab
gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks
echo 'Cloud Datalab Jupyter server successfully deployed.'
else
err 'Failed to run Cloud Datalab'
fi
}
回答1:
You should use Datalab initialization action to install Datalab on Dataproc cluster:
gcloud dataproc clusters create ${CLUSTER} \
--image-version=1.3 \
--scopes cloud-platform \
--initialization-actions=gs://dataproc-initialization-actions/datalab/datalab.sh
After this Hive works with GCS out of the box in Datalab:
from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""SHOW TABLES IN default""").show()
Output:
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
Creating external table on GCS using Hive in Datalab:
hc.sql("""CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://<BUCKET>/datasets/invoices'""")
Output:
DataFrame[]
Querying GCS table using Hive in Datalab:
hc.sql("""SELECT * FROM invoices LIMIT 10""")
Output:
DataFrame[SubmissionDate: date, TransactionAmount: double, TransactionType: string]
回答2:
If you want to use Hive in datalab, you have to enable hive metastore
--properties hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets \
--metadata "hive-metastore-instance=$PROJECT:$REGION:hive-metastore"
In your case will be
--properties hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets \
--metadata "hive-metastore-instance=$PROJECT:$REGION:hive-metastore"
hc.sql(“””CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://$PROJECT-warehouse/datasets/invoices’”””)
And make sure add following setting to enable GCS
sc._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
# This is required if you are using service account and set true,
sc._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'false')
sc._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "/path/to/keyfile")
# Following are required if you are using oAuth
sc._jsc.hadoopConfiguration().set('fs.gs.auth.client.id', 'YOUR_OAUTH_CLIENT_ID')
sc._jsc.hadoopConfiguration().set('fs.gs.auth.client.secret', 'OAUTH_SECRET')
来源:https://stackoverflow.com/questions/55944773/issue-querying-a-hive-table-in-datalab