Apache Drill using Google Cloud Storage

谁都会走 提交于 2019-12-02 01:55:28

问题


The Apache Drill features list mentions that it can query data from Google Cloud Storage, but I can't find any information on how to do that. I've got it working fine with S3, but suspect i'm missing something very simple in terms of Google Cloud Storage.

Does anyone have an example Storage Plugin configuration for Google Cloud Storage?

Thanks

M


回答1:


I managed to query parquet data in Google Cloud Storage (GCS) using Apache Drill (1.6.0) running on a Google Dataproc cluster. In order to set that up, I took the following steps:

  1. Install Drill and make the GCS connector accessible (this can be used as an init-script for dataproc, just note it wasn't really tested and relies on a local zookeeper instance):

    #!/bin/sh
    set -x -e
    BASEDIR="/opt/apache-drill-1.6.0"
    mkdir -p ${BASEDIR}
    cd ${BASEDIR}
    wget http://apache.mesi.com.ar/drill/drill-1.6.0/apache-drill-1.6.0.tar.gz
    tar -xzvf apache-drill-1.6.0.tar.gz
    mv apache-drill-1.6.0/* .
    rm -rf apache-drill-1.6.0 apache-drill-1.6.0.tar.gz
    
    ln -s /usr/lib/hadoop/lib/gcs-connector-1.4.5-hadoop2.jar ${BASEDIR}/jars/gcs-connector-1.4.5-hadoop2.jar
    mv ${BASEDIR}/conf/core-site.xml ${BASEDIR}/conf/core-site.xml.old
    ln -s /etc/hadoop/conf/core-site.xml ${BASEDIR}/conf/core-site.xml
    
    drillbit.sh start
    
    set +x +e
    
  2. Connect to the Drill console, create a new storage plugin (call it, say, gcs), and use the following configuration (note I copied most of it from the s3 config, made minor changes):

    {
      "type": "file",
      "enabled": true,
      "connection": "gs://myBucketName",
      "config": null,
      "workspaces": {
        "root": {
          "location": "/",
          "writable": false,
          "defaultInputFormat": null
        },
        "tmp": {
          "location": "/tmp",
          "writable": true,
          "defaultInputFormat": null
        }
      },
      "formats": {
        "psv": {
          "type": "text",
          "extensions": [
            "tbl"
          ],
          "delimiter": "|"
        },
        "csv": {
          "type": "text",
          "extensions": [
            "csv"
          ],
          "delimiter": ","
        },
        "tsv": {
          "type": "text",
          "extensions": [
            "tsv"
          ],
          "delimiter": "\t"
        },
        "parquet": {
          "type": "parquet"
        },
        "json": {
          "type": "json",
          "extensions": [
            "json"
          ]
        },
        "avro": {
          "type": "avro"
        },
        "sequencefile": {
          "type": "sequencefile",
          "extensions": [
            "seq"
          ]
        },
        "csvh": {
          "type": "text",
          "extensions": [
            "csvh"
          ],
          "extractHeader": true,
          "delimiter": ","
        }
      }
    }
    
  3. Query using the following syntax (note the backticks):

    select * from gs.`root`.`path/to/data/*` limit 10;
    


来源:https://stackoverflow.com/questions/32883965/apache-drill-using-google-cloud-storage

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!