Apache Drill using Google Cloud Storage

前端未结

关注

 2  744

温柔的废话 2021-01-21 14:15

The Apache Drill features list mentions that it can query data from Google Cloud Storage, but I can\'t find any information on how to do that. I\'ve got it working fine with S3

2条回答

故里飘歌 (楼主)

2021-01-21 15:01

I managed to query parquet data in Google Cloud Storage (GCS) using Apache Drill (1.6.0) running on a Google Dataproc cluster. In order to set that up, I took the following steps:

Install Drill and make the GCS connector accessible (this can be used as an init-script for dataproc, just note it wasn't really tested and relies on a local zookeeper instance):

#!/bin/sh
set -x -e
BASEDIR="/opt/apache-drill-1.6.0"
mkdir -p ${BASEDIR}
cd ${BASEDIR}
wget http://apache.mesi.com.ar/drill/drill-1.6.0/apache-drill-1.6.0.tar.gz
tar -xzvf apache-drill-1.6.0.tar.gz
mv apache-drill-1.6.0/* .
rm -rf apache-drill-1.6.0 apache-drill-1.6.0.tar.gz

ln -s /usr/lib/hadoop/lib/gcs-connector-1.4.5-hadoop2.jar ${BASEDIR}/jars/gcs-connector-1.4.5-hadoop2.jar
mv ${BASEDIR}/conf/core-site.xml ${BASEDIR}/conf/core-site.xml.old
ln -s /etc/hadoop/conf/core-site.xml ${BASEDIR}/conf/core-site.xml

drillbit.sh start

set +x +e

Connect to the Drill console, create a new storage plugin (call it, say, gcs), and use the following configuration (note I copied most of it from the s3 config, made minor changes):

{
  "type": "file",
  "enabled": true,
  "connection": "gs://myBucketName",
  "config": null,
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json",
      "extensions": [
        "json"
      ]
    },
    "avro": {
      "type": "avro"
    },
    "sequencefile": {
      "type": "sequencefile",
      "extensions": [
        "seq"
      ]
    },
    "csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true,
      "delimiter": ","
    }
  }
}

Query using the following syntax (note the backticks):
```
select * from gs.`root`.`path/to/data/*` limit 10;
```

0 讨论(0)

查看其它2个回答