Google Composer- How do I install Microsoft SQL Server ODBC drivers on environments

后端 未结 4 1879
灰色年华
灰色年华 2021-01-16 09:37

I am new to GCP and Airflow and am trying to run my python pipelines via a simple PYODBC connection via python 3. However, I believe I have found what I need to install on t

相关标签:
4条回答
  • 2021-01-16 10:23

    I was facing the same problem. The first solution which worked for me was building a docker image that would install the drivers and then run the code. Initially I tried to find a way of installing the drivers on the cluster but after many failures I read in documentation that the airflow image in composer is curated by Google and no changes affecting the image are allowable. So here is my docker file:

    FROM python:3.7-slim-buster
    #FROM gcr.io/data-development-254912/gcp_bi_baseimage 
    #FROM gcp_bi_baseimage
    LABEL maintainer = " " 
    ENV APP_HOME /app 
    WORKDIR $APP_HOME
    COPY / ./
    # install nano 
    RUN apt-get update \
        && apt-get install --yes --no-install-recommends \
            apt-utils \
            apt-transport-https \
            curl \
            gnupg \
            unixodbc-dev \ 
            gcc \
            g++ \ 
            nano \
        && curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - \
        && curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list \
        && apt-get update \
        && ACCEPT_EULA=Y apt-get install --yes --no-install-recommends msodbcsql17 \
        && apt-get install libgssapi-krb5-2 \
        && apt-get clean \
        && rm -rf /var/lib/apt/lists/* \
        && rm -rf /tmp/*
     RUN pip install -r requirements.txt
     CMD ["python","app.py"]
    

    requirements.txt:

    pyodbc==4.0.28
    google-cloud-bigquery==1.24.0    
    google-cloud-storage==1.26.0
    

    You should be good from this point on.

    Since then I managed to set up an Airflow named connection to our sql server and am using mssql_operator or mssql_hook. I had worked with a cloud engineer to set up the networking just right. What I found is that the named connection is much easier to use, yet kubernetesPodOperator is still much more reliable.

    0 讨论(0)
  • 2021-01-16 10:27

    Cloud Composer currently primarily supports installing PyPI packages written in pure Python. Installing system packages is not fully supported at this time, but there are some workarounds (such as setting LD_LIBRARY_PATH and uploading shared libraries, etc). You're getting aborts because you installed the Python part of the package, but not system dependencies the Python package depends on.

    As you read, changes to Airflow workers in Composer are ephemeral (or at least, should be treated as such), but one way of working around this is to install packages using BashOperator before the task that needs the library runs. It's not pretty, but it ensure that dependencies are installed on the worker before the Python code that needs them is called.

    0 讨论(0)
  • 2021-01-16 10:28

    It's Simple ! No Need of DockerFile, KubernetesPodOperator, LD_LIBRARY_PATH, etc just a basic python operator will do

    Points to consider

    • GCP Composer Worker's Pod image is ubuntu 1604 (just run a basic python operator with a command os.system('cat /etc/os-release') to check )
    • It has unixodbc-dev already installed on the worker's pod image
    • Composer creates the bucket and mount it with airflow
    • So why not just install the pyodbc from pypi packages and provide the mssql odbc driver as paramter in pyodbc connection method

    here 'gs://bucket_created_by_composer' == '/home/airflow/gcs'

    gcs bucket created by composer ->
              -> data/
              -> dags/
    

    Step By Step Approach

    Step 1: Install pyodbc, mssql odbc on any ubuntu instances to get the driver files

    for consideration lets do it on GCP VM Intance with ubuntu 1804 image

    #update the packages
    sudo apt update
    sudo apt-get update -y
    curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -
    curl https://packages.microsoft.com/config/ubuntu/18.04/prod.list | sudo tee /etc/apt/sources.list.d/msprod.list
    sudo apt-get update -y
    echo Installing mssql-tools and unixODBC developer...
    sudo ACCEPT_EULA=Y apt-get install -y mssql-tools unixodbc-dev
    sudo apt-get update -y
    sudo apt-get install  -y mssql-tools #it includes sql_cmd and bcp (we dont need those)
    sudo apt install python3-pip #installing pip3
    pip3 install pyodbc 
    

    Step 2: Get the Driver Files and upload it to the data folder of gcs_bucket created by the composer

    cd /opt/microsoft
    #now you can see there is one directory 'msodbcsql17', version may change
    #we need to upload this directory to the data folder of gcs_bucket
    
    #for this you may choose which ever approach suits you
    #copying the directory to /<home/user> for proper zipping/uploading to gcs
    cp -r msodbcsql17 /home/<user> #you may need to use sudo 
    #upload this /home/<user>/msodbcsql17 to any gcs_bucket 
    gsutil cp -r /home/<user>/msodbcsql17 gs://<your-gcs-bucket>
    

    download this folder from gcs bucket to local and the upload this folder to data folder of gcs bucket created by composer

    choose any approach/method, main aim is to get the msodbcsql17 folder in the data folder of gcs bucket created by composer

    Final structure:

    gcs bucket created by composer ->
              -> data/msodbcsql17/
              -> dags/<your_dags.py>
    

    Step 3: using this msodbcsql17 drivers for pyodbc connection

    EXAMPLE DAG:

    import os
    import time
    import datetime
    import argparse
    import json
    from airflow import DAG
    import airflow
    
    from airflow.operators import python_operator
    
    
    default_dag_args = {
        'start_date': airflow.utils.dates.days_ago(0), #
        'provide_context': True
    }
    
    
    
    dag = DAG(
            'pyodbc_test',
            schedule_interval=None, #change for composer
            default_args=default_dag_args
            )
    
    
    def check_connection(**kwargs):
        print('hello')
        driver='/home/airflow/gcs/data/msodbcsql17/lib64/libmsodbcsql-17.5.so.2.1'
        #this is the main driver file, the exact location can be found on gcs_bucket/data folder or check the /etc/odbcinst.in file of ubuntu instance in which you installed the pyodbc earlier
    
        def tconnection(ServerIp,LoginName,Password,mssql_portno):
            """ A method which return connection object"""
            import pyodbc
            pyodbc.pooling = False 
            try:   
                sql_conn = pyodbc.connect("DRIVER={4};SERVER={0},{1};UID={2};PWD={3}".format(ServerIp,mssql_portno,LoginName,Password,driver)) 
            except pyodbc.Error as ex:
                sqlstate = ex.args[1]
                raise
    
    
            return sql_conn
    
        con=tconnection('<your-server-ip>','<your-login-name>','<your-password>','1433')
        #recommendation is to take the password and login from airflow connections
        import pandas as pd
        q='select * from <your-db-name>.<your-schema-name>.<your-table-name>'
        df=pd.read_sql(q,con)
        print(df)
    
    Tcheck_connection= python_operator.PythonOperator(
            task_id='Tcheck_connection',
            python_callable=check_connection,
            dag=dag ) 
    
    
    #calling the task sequence
    Tcheck_connection 
    

    PYPI Packages

    pyodbc
    pandas
    

    Have tested on Composer recently

    0 讨论(0)
  • 2021-01-16 10:28

    Consider that Composer is a Google managed implementation of Apache Airflow hence, expect it to behave differently.

    Having this in mind, custom Python dependincies and binary dependencies not available in the Cloud Composer worker image can use the KubernetesPodOperator option.

    What this does essentially, is to allow you to create a custom container image with all your requirements, push it into a container image repository (Dockerhub, GCR) and then pull it into your Composer environment, so all of your dependencies are met.

    This escalates better as there is no need for you to interact with the machines (this approach is stated in your original question), and it looks easier to just build your container image with whatever you need in there.

    Specifically speaking of pyodbc and in this context of dependency installation using Composer, there is a feature request to address this issue, that also outlines a workaround (basically what is mentioned in this answer). You might want to check it out.

    0 讨论(0)
提交回复
热议问题