Does Apache Spark load entire data from target database?

前端 未结 2 591
逝去的感伤
逝去的感伤 2021-01-16 08:13

I want to use Apache Spark and connect to Vertica by JDBC.

In Vertica database, I have 100 million records and spark code runs on another server.

When I run

相关标签:
2条回答
  • 2021-01-16 08:54

    After your spark jobs finishes logon to the Vertica database using the same credentials that the spark job used and run:

    SELECT * FROM v_monitor.query_requests ORDER BY start_timetamp DESC LIMIT 10000;
    

    This will show you the queries sent to the database by the spark job, allowing you to see if it pushed down the count(*) to the database or if it indeed tried to retrieve the entire table across the network.

    0 讨论(0)
  • 2021-01-16 09:01

    JDBC based DBs allow push down queries so that you will read from the disk only relevant items: ex: df.filter("user_id == 2").count will first select only records filtered and then ship count to spark. So using JDBC: 1. plan for filters, 2. partition your DB according to your query patterns and further optimise form spark side as ex:

    val prop = new java.util.Properties
    prop.setProperty("driver","org.postgresql.Driver")
    prop.setProperty("partitionColumn", "user_id")
    prop.setProperty("lowerBound", "1")
    prop.setProperty("upperBound", "272")
    prop.setProperty("numPartitions", "30")
    

    However, most relational DB are partitioned by specific fields in a tree lke structure which is not ideal for complex big data queries: I strongly suggest to copy the table from JDBC to no-sql such as cassandra, mongo, elastic serach or file systems such as alluxio or hdfs in order to enable scalable - parallel - complex - fast queries. Lastly, you can replace JDBC with aws redshift which should not be that hard to implement for backend / front end, however from your spark side it is a pain to deal with re dependency conflicts - but it will enable you to conduct complex queries much faster as it partition columns so you can have push down aggregates on columns themselves using multiple workers .

    0 讨论(0)
提交回复
热议问题