Inserting Analytic data from Spark to Postgres

后端 未结 4 1976
天涯浪人
天涯浪人 2020-12-30 12:56

I have Cassandra database from which i analyzed the data using SparkSQL through Apache Spark. Now i want to insert those analyzed data into PostgreSQL . Is there any ways to

相关标签:
4条回答
  • 2020-12-30 13:53

    You can use Postgres copy api to write it, its much faster that way. See following two methods - one iterates over RDD to fill the buffer that can be saved by copy api. Only thing you have to take care of is creating correct statement in csv format that will be used by copy api.

    def saveToDB(rdd: RDD[Iterable[EventModel]]): Unit = {
            val sb = mutable.StringBuilder.newBuilder
            val now = System.currentTimeMillis()
    
            rdd.collect().foreach(itr => {
                itr.foreach(_.createCSV(sb, now).append("\n"))
            })
    
            copyIn("myTable",  new StringReader(sb.toString), "statement")
            sb.clear
        }
    
    
    def copyIn(tableName: String, reader: java.io.Reader, columnStmt: String = "") = {
            val conn = connectionPool.getConnection()
            try {
                conn.unwrap(classOf[PGConnection]).getCopyAPI.copyIn(s"COPY $tableName $columnStmt FROM STDIN WITH CSV", reader)
            } catch {
                case se: SQLException => logWarning(se.getMessage)
                case t: Throwable => logWarning(t.getMessage)
            } finally {
                conn.close()
            }
        }
    
    0 讨论(0)
  • 2020-12-30 13:54

    Answer by 0x0FFF is good. Here is an additional point that would be useful.

    I use foreachPartition to persist to external store. This is also inline with the design pattern Design Patterns for using foreachRDD given in Spark documentation https://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#output-operations-on-dstreams

    Example:

    dstream.foreachRDD { rdd =>
      rdd.foreachPartition { partitionOfRecords =>
        // ConnectionPool is a static, lazily initialized pool of connections
        val connection = ConnectionPool.getConnection()
        partitionOfRecords.foreach(record => connection.send(record))
        ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
      }
    }
    
    0 讨论(0)
  • 2020-12-30 13:58

    The answers above refers to old spark versions, in spark 2.* there is jdbc connector, enable write directly to RDBS from a dataFrame.

    example:

    jdbcDF2.write.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
              properties={"user": "username", "password": "password"})
    

    https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

    0 讨论(0)
  • 2020-12-30 14:00

    At the moment there is no native implementation of writing the RDD to any DBMS. Here are the links to the related discussions in the Spark user list: one, two

    In general, the most performant approach would be the following:

    1. Validate the number of partitions in RDD, it should not be too low and too high. 20-50 partitions should be fine, if the number is lower - call repartition with 20 partitions, if higher - call coalesce to 50 partitions
    2. Call the mapPartition transformation, inside of it call the function to insert the records to your DBMS using JDBC. In this function you open the connection to your database and use the COPY command with this API, it would allow you to eliminate the need for a separate command for each record - this way the insert would be processed much faster

    This way you would insert the data into Postgres in a parallel fashion utilizing up to 50 parallel connection (depends on your Spark cluster size and its configuration). The whole approach might be implemented as a Java/Scala function accepting the RDD and the connection string

    0 讨论(0)
提交回复
热议问题