how to fetch multiple tables using spark sql

问题

I am fetching data from mysql using pyspark which for only one table.I want to fetch all tables from mysql db. Don't want call jdbc connection again and again. see code below

Is it possible to simplify my code? Thank you in advance

url = "jdbc:mysql://localhost:3306/dbname"
table_df=sqlContext.read.format("jdbc").option("url",url).option("dbtable","table_name").option("user","root").option("password", "root").load()
sqlContext.registerDataFrameAsTable(table_df, "table1")

table_df_1=sqlContext.read.format("jdbc").option("url",url).option("dbtable","table_name_1").option("user","root").option("password", "root").load()
sqlContext.registerDataFrameAsTable(table_df_1, "table2")

回答1:

you need somehow to acquire the list of the tables you have in mysql. Either you find some sql commands to do that, or you manually create a file containing everything.

Then, assuming you can create a list of tablenames in python tablename_list, you can simply loop over it like this :

url = "jdbc:mysql://localhost:3306/dbname"
reader = sqlContext.read.format("jdbc").option("url",url).option("user","root").option("password", "root")
for tablename in tablename_list:
    reader.option("dbtable",tablename).load().createTempView(tablename)

This will create a temporary view with the same tablename. If you want another name, you can probably change the initial tablename_list with a list of tuple (tablename_in_mysql, tablename_in_spark).

回答2:

@Steven already gave a perfect answer. As he said, in order to find a Python list of tablenames, you can use:

#list of the tables in the server
table_names_list = spark.read.format('jdbc'). \
     options(
         url='jdbc:postgresql://localhost:5432/', # database url (local, remote)
         dbtable='information_schema.tables',
         user='YOUR_USERNAME',
         password='YOUR_PASSWORD',
         driver='org.postgresql.Driver'). \
     load().\
     filter("table_schema = 'public'").select("table_name")
#DataFrame[table_name: string]

# table_names_list.collect()
# [Row(table_name='employee'), Row(table_name='bonus')]

table_names_list = [row.table_name for row in table_names_list.collect()]
print(table_names_list)
# ['employee', 'bonus']

Note that this is in PostgreSQL. You can easily change url and driver arguments.

来源：https://stackoverflow.com/questions/54493740/how-to-fetch-multiple-tables-using-spark-sql

标签

python

pyspark-sql