Unsupported Array error when reading JDBC source in (Py)Spark?

问题

Trying to convert postgreSQL DB to Dataframe . Following is my code:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Connect to DB") \
    .getOrCreate()

jdbcUrl = "jdbc:postgresql://XXXXXX" 
connectionProperties = {
  "user" : " ",
  "password" : " ",
  "driver" : "org.postgresql.Driver"
}

query = "(SELECT table_name FROM information_schema.tables) XXX"
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)

table_name_list = df.select("table_name").rdd.flatMap(lambda x: x).collect() 
    for table_name in table_name_list:
          df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)

Error I am getting :

java.sql.SQLException: Unsupported type ARRAY on generating df2 for table name

If I hard code table name value , I do not get the same error

df2 = spark.read.jdbc(jdbcUrl,"conditions",properties=connectionProperties)

I checked table_name type and it is String , is this the correct approach ?

回答1:

I guess you don't want the table names that belong to internal working of postgres such as pg_type, pg_policies etc whose schema are of type pg_catalog that causes the error

py4j.protocol.Py4JJavaError: An error occurred while calling o34.jdbc. : java.sql.SQLException: Unsupported type ARRAY

when you try to read them as

spark.read.jdbc(url=jdbcUrl, table='pg_type', properties=connectionProperties)

and there are tables such as applicable_roles, view_table_usage etc whose schema are of type information_schema that causes

py4j.protocol.Py4JJavaError: An error occurred while calling o34.jdbc. : org.postgresql.util.PSQLException: ERROR: relation "view_table_usage" does not exist

when you try to read them as

spark.read.jdbc(url=jdbcUrl, table='view_table_usage', properties=connectionProperties)

The tables whose schema types are public can be read into tables using above jdbc commands.

I checked table_name type and it is String , is this the correct approach ?

So you need to filter out those table names and apply your logic as

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Connect to DB") \
    .getOrCreate()

jdbcUrl = "jdbc:postgresql://hostname:post/" 
connectionProperties = {
  "user" : " ",
  "password" : " ",
  "driver" : "org.postgresql.Driver"
}

query = "information_schema.tables"
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)

table_name_list = df.filter((df["table_schema"] != 'pg_catalog') & (df["table_schema"] != 'information_schema')).select("table_name").rdd.flatMap(lambda x: x).collect() 
    for table_name in table_name_list:
          df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)

That should work

来源：https://stackoverflow.com/questions/50613977/unsupported-array-error-when-reading-jdbc-source-in-pyspark

标签

python

apache-spark

pyspark

apache-spark-sql

pyspark-sql