I have a lambda process which occasionally polls an API for recent data. This data has unique keys, and I\'d like to use Glue to update the table in MySQL. Is there an optio
I ran into the same issue with Redshift, and the best solution we could come up with was to create a Java class that loads the MySQL driver and issues a truncate table:
package com.my.glue.utils.mysql;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.sql.Statement;
@SuppressWarnings("unused")
public class MySQLTruncateClient {
public void truncate(String tableName, String url) throws SQLException, ClassNotFoundException {
Class.forName("com.mysql.jdbc.Driver");
try (Connection mysqlConnection = DriverManager.getConnection(url);
Statement statement = mysqlConnection.createStatement()) {
statement.execute(String.format("TRUNCATE TABLE %s", tableName));
}
}
}
Upload that JAR to S3 along with your MySQL Jar dependency and make your job dependent on those. In your PySpark script, you can load your truncate method with:
java_import(glue_context._jvm, "com.my.glue.utils.mysql.MySQLTruncateClient")
truncate_client = glue_context._jvm.MySQLTruncateClient()
truncate_client.truncate('my_table', 'jdbc:mysql://...')
The workaround I've come up with, which is a little simpler than the alternative posted, is the following:
REPLACE INTO myTable SELECT * FROM myStagingTable;
This can be done with:
import sys from awsglue.transforms
import * from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
import pymysql
pymysql.install_as_MySQLdb()
import MySQLdb
db = MySQLdb.connect("URL", "USERNAME", "PASSWORD", "DATABASE")
cursor = db.cursor()
cursor.execute("REPLACE INTO myTable SELECT * FROM myStagingTable")
cursor.fetchall()
db.close()
job.commit()
I found a simpler way working with JDBC connections in Glue. The way the Glue team recommends to truncate a table is via following sample code when you're writing data to your Redshift cluster:
datasink5 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = resolvechoice4, catalog_connection = "<connection-name>", connection_options = {"dbtable": "<target-table>", "database": "testdb", "preactions":"TRUNCATE TABLE <table-name>"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink5")
where
connection-name your Glue connection name to your Redshift Cluster
target-table the table you're loading the data in
testdb name of the database
table-name name of the table to truncate (ideally the table you're loading into)