The Apache Drill features list mentions that it can query data from Google Cloud Storage, but I can\'t find any information on how to do that. I\'ve got it working fine with S3
I managed to query parquet data in Google Cloud Storage (GCS) using Apache Drill (1.6.0) running on a Google Dataproc cluster. In order to set that up, I took the following steps:
Install Drill and make the GCS connector accessible (this can be used as an init-script for dataproc, just note it wasn't really tested and relies on a local zookeeper instance):
#!/bin/sh
set -x -e
BASEDIR="/opt/apache-drill-1.6.0"
mkdir -p ${BASEDIR}
cd ${BASEDIR}
wget http://apache.mesi.com.ar/drill/drill-1.6.0/apache-drill-1.6.0.tar.gz
tar -xzvf apache-drill-1.6.0.tar.gz
mv apache-drill-1.6.0/* .
rm -rf apache-drill-1.6.0 apache-drill-1.6.0.tar.gz
ln -s /usr/lib/hadoop/lib/gcs-connector-1.4.5-hadoop2.jar ${BASEDIR}/jars/gcs-connector-1.4.5-hadoop2.jar
mv ${BASEDIR}/conf/core-site.xml ${BASEDIR}/conf/core-site.xml.old
ln -s /etc/hadoop/conf/core-site.xml ${BASEDIR}/conf/core-site.xml
drillbit.sh start
set +x +e
Connect to the Drill console, create a new storage plugin (call it, say, gcs
), and use the following configuration (note I copied most of it from the s3 config, made minor changes):
{
"type": "file",
"enabled": true,
"connection": "gs://myBucketName",
"config": null,
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json",
"extensions": [
"json"
]
},
"avro": {
"type": "avro"
},
"sequencefile": {
"type": "sequencefile",
"extensions": [
"seq"
]
},
"csvh": {
"type": "text",
"extensions": [
"csvh"
],
"extractHeader": true,
"delimiter": ","
}
}
}
Query using the following syntax (note the backticks):
select * from gs.`root`.`path/to/data/*` limit 10;