Overwrite MySQL tables with AWS Glue

后端 未结 3 968
时光取名叫无心
时光取名叫无心 2021-01-12 17:38

I have a lambda process which occasionally polls an API for recent data. This data has unique keys, and I\'d like to use Glue to update the table in MySQL. Is there an optio

相关标签:
3条回答
  • 2021-01-12 18:20

    I ran into the same issue with Redshift, and the best solution we could come up with was to create a Java class that loads the MySQL driver and issues a truncate table:

    package com.my.glue.utils.mysql;
    
    import java.sql.Connection;
    import java.sql.DriverManager;
    import java.sql.SQLException;
    import java.sql.Statement;
    
    @SuppressWarnings("unused")
    public class MySQLTruncateClient {
        public void truncate(String tableName, String url) throws SQLException, ClassNotFoundException {
            Class.forName("com.mysql.jdbc.Driver");
            try (Connection mysqlConnection = DriverManager.getConnection(url);
                Statement statement = mysqlConnection.createStatement()) {
                statement.execute(String.format("TRUNCATE TABLE %s", tableName));
            }
        }
    }
    

    Upload that JAR to S3 along with your MySQL Jar dependency and make your job dependent on those. In your PySpark script, you can load your truncate method with:

    java_import(glue_context._jvm, "com.my.glue.utils.mysql.MySQLTruncateClient")
    truncate_client = glue_context._jvm.MySQLTruncateClient()
    truncate_client.truncate('my_table', 'jdbc:mysql://...')
    
    0 讨论(0)
  • 2021-01-12 18:39

    The workaround I've come up with, which is a little simpler than the alternative posted, is the following:

    • Create a staging table in mysql, and load your new data into this table.
    • Run the command: REPLACE INTO myTable SELECT * FROM myStagingTable;
    • Truncate the staging table

    This can be done with:

    import sys from awsglue.transforms
    import * from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    import pymysql
    pymysql.install_as_MySQLdb()
    import MySQLdb
    db = MySQLdb.connect("URL", "USERNAME", "PASSWORD", "DATABASE")
    cursor = db.cursor()
    cursor.execute("REPLACE INTO myTable SELECT * FROM myStagingTable")
    cursor.fetchall()
    
    db.close()
    job.commit()
    
    0 讨论(0)
  • 2021-01-12 18:41

    I found a simpler way working with JDBC connections in Glue. The way the Glue team recommends to truncate a table is via following sample code when you're writing data to your Redshift cluster:

    datasink5 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = resolvechoice4, catalog_connection = "<connection-name>", connection_options = {"dbtable": "<target-table>", "database": "testdb", "preactions":"TRUNCATE TABLE <table-name>"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink5")
    

    where

    connection-name your Glue connection name to your Redshift Cluster
    target-table    the table you're loading the data in 
    testdb          name of the database 
    table-name      name of the table to truncate (ideally the table you're loading into)
    
    0 讨论(0)
提交回复
热议问题