how to fetch multiple tables using spark sql

女生的网名这么多〃 提交于 2021-01-05 11:01:17

问题


I am fetching data from mysql using pyspark which for only one table.I want to fetch all tables from mysql db. Don't want call jdbc connection again and again. see code below

Is it possible to simplify my code? Thank you in advance

url = "jdbc:mysql://localhost:3306/dbname"
table_df=sqlContext.read.format("jdbc").option("url",url).option("dbtable","table_name").option("user","root").option("password", "root").load()
sqlContext.registerDataFrameAsTable(table_df, "table1")

table_df_1=sqlContext.read.format("jdbc").option("url",url).option("dbtable","table_name_1").option("user","root").option("password", "root").load()
sqlContext.registerDataFrameAsTable(table_df_1, "table2")

回答1:


you need somehow to acquire the list of the tables you have in mysql. Either you find some sql commands to do that, or you manually create a file containing everything.

Then, assuming you can create a list of tablenames in python tablename_list, you can simply loop over it like this :

url = "jdbc:mysql://localhost:3306/dbname"
reader = sqlContext.read.format("jdbc").option("url",url).option("user","root").option("password", "root")
for tablename in tablename_list:
    reader.option("dbtable",tablename).load().createTempView(tablename)

This will create a temporary view with the same tablename. If you want another name, you can probably change the initial tablename_list with a list of tuple (tablename_in_mysql, tablename_in_spark).




回答2:


@Steven already gave a perfect answer. As he said, in order to find a Python list of tablenames, you can use:

#list of the tables in the server
table_names_list = spark.read.format('jdbc'). \
     options(
         url='jdbc:postgresql://localhost:5432/', # database url (local, remote)
         dbtable='information_schema.tables',
         user='YOUR_USERNAME',
         password='YOUR_PASSWORD',
         driver='org.postgresql.Driver'). \
     load().\
     filter("table_schema = 'public'").select("table_name")
#DataFrame[table_name: string]

# table_names_list.collect()
# [Row(table_name='employee'), Row(table_name='bonus')]

table_names_list = [row.table_name for row in table_names_list.collect()]
print(table_names_list)
# ['employee', 'bonus']

Note that this is in PostgreSQL. You can easily change url and driver arguments.



来源:https://stackoverflow.com/questions/54493740/how-to-fetch-multiple-tables-using-spark-sql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!