问题
The Apache Drill features list mentions that it can query data from Google Cloud Storage, but I can't find any information on how to do that. I've got it working fine with S3, but suspect i'm missing something very simple in terms of Google Cloud Storage.
Does anyone have an example Storage Plugin configuration for Google Cloud Storage?
Thanks
M
回答1:
I managed to query parquet data in Google Cloud Storage (GCS) using Apache Drill (1.6.0) running on a Google Dataproc cluster. In order to set that up, I took the following steps:
Install Drill and make the GCS connector accessible (this can be used as an init-script for dataproc, just note it wasn't really tested and relies on a local zookeeper instance):
#!/bin/sh set -x -e BASEDIR="/opt/apache-drill-1.6.0" mkdir -p ${BASEDIR} cd ${BASEDIR} wget http://apache.mesi.com.ar/drill/drill-1.6.0/apache-drill-1.6.0.tar.gz tar -xzvf apache-drill-1.6.0.tar.gz mv apache-drill-1.6.0/* . rm -rf apache-drill-1.6.0 apache-drill-1.6.0.tar.gz ln -s /usr/lib/hadoop/lib/gcs-connector-1.4.5-hadoop2.jar ${BASEDIR}/jars/gcs-connector-1.4.5-hadoop2.jar mv ${BASEDIR}/conf/core-site.xml ${BASEDIR}/conf/core-site.xml.old ln -s /etc/hadoop/conf/core-site.xml ${BASEDIR}/conf/core-site.xml drillbit.sh start set +x +e
Connect to the Drill console, create a new storage plugin (call it, say,
gcs
), and use the following configuration (note I copied most of it from the s3 config, made minor changes):{ "type": "file", "enabled": true, "connection": "gs://myBucketName", "config": null, "workspaces": { "root": { "location": "/", "writable": false, "defaultInputFormat": null }, "tmp": { "location": "/tmp", "writable": true, "defaultInputFormat": null } }, "formats": { "psv": { "type": "text", "extensions": [ "tbl" ], "delimiter": "|" }, "csv": { "type": "text", "extensions": [ "csv" ], "delimiter": "," }, "tsv": { "type": "text", "extensions": [ "tsv" ], "delimiter": "\t" }, "parquet": { "type": "parquet" }, "json": { "type": "json", "extensions": [ "json" ] }, "avro": { "type": "avro" }, "sequencefile": { "type": "sequencefile", "extensions": [ "seq" ] }, "csvh": { "type": "text", "extensions": [ "csvh" ], "extractHeader": true, "delimiter": "," } } }
Query using the following syntax (note the backticks):
select * from gs.`root`.`path/to/data/*` limit 10;
来源:https://stackoverflow.com/questions/32883965/apache-drill-using-google-cloud-storage