Regarding Spark Dataframereader jdbc

☆樱花仙子☆ 提交于 2020-01-06 04:02:57

问题


I have a question regarding Mechanics of Spark Dataframereader. I will appreciate if anybody can help me. Let me explain the Scenario here

I am creating a DataFrame from Dstream like this. This in Input Data

 var config = new HashMap[String,String]();
        config += ("zookeeper.connect" ->zookeeper);        
        config += ("partition.assignment.strategy" ->"roundrobin");
        config += ("bootstrap.servers" ->broker);
        config += ("serializer.class" -> "kafka.serializer.DefaultEncoder");
        config += ("group.id" -> "default"); 

        val lines =  KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc,config.toMap,Set(topic)).map(_._2)

        lines.foreachRDD { rdd =>

                if(!rdd.isEmpty()){

                    val rddJson = rdd.map { x => MyFunctions.mapToJson(x) }       





                    val sqlContext = SQLContextSingleton.getInstance(ssc.sparkContext)

                    val rddDF = sqlContext.read.json(rddJson)

                    rddDF.registerTempTable("inputData")



 val dbDF = ReadDataFrameHelper.readDataFrameHelperFromDB(sqlContext, jdbcUrl, "ABCD","A",numOfPartiton,lowerBound,upperBound)

Here is the code of ReadDataFrameHelper

def readDataFrameHelperFromDB(sqlContext:HiveContext,jdbcUrl:String,dbTableOrQuery:String,
            columnToPartition:String,numOfPartiton:Int,lowerBound:Int,highBound:Int):DataFrame={

        val jdbcDF = sqlContext.read.jdbc(url = jdbcUrl, table = dbTableOrQuery,
                columnName = columnToPartition,
                lowerBound = lowerBound,
                upperBound = highBound,
                numPartitions = numOfPartiton,
                connectionProperties = new java.util.Properties()
                )

            jdbcDF  

    }

Lastly i am doing a Join like this

 val joinedData = rddDF.join(dbDF,rddDF("ID") === dbDF("ID")
                                 && rddDF("CODE") === dbDF("CODE"),"left_outer")
                        .drop(dbDF("code"))
                        .drop(dbDF("id"))
                        .drop(dbDF("number"))
                        .drop(dbDF("key"))
                        .drop(dbDF("loaddate"))
                        .drop(dbDF("fid"))
joinedData.show()

My input DStream will have 1000 rows and data will contains million of rows. So when i do this join, will spark load all the rows from database and read those rows or will this just read the those specific rows from DB which have the code,id from the input DStream


回答1:


As specified by zero323, i have also confirmed that data will be read full from the table. I checked the DB session logs and saw that whole dataset is getting loaded.

Thanks zero323



来源:https://stackoverflow.com/questions/37468418/regarding-spark-dataframereader-jdbc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!