I want to use Apache Spark and connect to Vertica by JDBC.
In Vertica database, I have 100 million records and spark code runs on another server.
When I run
After your spark jobs finishes logon to the Vertica database using the same credentials that the spark job used and run:
SELECT * FROM v_monitor.query_requests ORDER BY start_timetamp DESC LIMIT 10000;
This will show you the queries sent to the database by the spark job, allowing you to see if it pushed down the count(*) to the database or if it indeed tried to retrieve the entire table across the network.
JDBC
based DBs
allow push down queries so that you will read from the disk only relevant items: ex: df.filter("user_id == 2").count
will first select only records filtered and then ship count to spark. So using JDBC
: 1. plan for filters, 2. partition your DB according to your query patterns and further optimise form spark side as ex:
val prop = new java.util.Properties
prop.setProperty("driver","org.postgresql.Driver")
prop.setProperty("partitionColumn", "user_id")
prop.setProperty("lowerBound", "1")
prop.setProperty("upperBound", "272")
prop.setProperty("numPartitions", "30")
However, most relational DB
are partitioned by specific fields in a tree lke structure which is not ideal for complex big data queries: I strongly suggest to copy the table from JDBC
to no-sql
such as cassandra
, mongo
, elastic serach
or file systems such as alluxio
or hdfs
in order to enable scalable - parallel - complex - fast queries. Lastly, you can replace JDBC
with aws redshift
which should not be that hard to implement for backend / front end, however from your spark side it is a pain to deal with re dependency conflicts - but it will enable you to conduct complex queries much faster as it partition columns so you can have push down aggregates on columns themselves using multiple workers
.