I am running spark 1.6 on 3 VMs (i.e. 1x master; 2x slaves) all with 4 cores and 16GB RAM.
I can see the workers registered on spark-master webUI.
I want to
While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4)
will fetch data first and then apply the limit
. In other words your database will execute (assuming no projections an filters) something equivalent to:
SELECT * FROM table
and the rest will handled by Spark. There are some optimizations involved (in particular Spark evaluates partitions iteratively to obtain number of records requested by LIMIT
) but it still quite inefficient process compared to database-side optimizations.
If you want to push limit
to the database you'll have to do it statically using subquery as a dbtable
parameter:
(sqlContext.read.format('jdbc')
.options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))
sqlContext.read.format("jdbc").options(Map(
"url" -> "xxxx",
"dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp",
))
Please note that an alias in subquery is mandatory.
Note:
This behavior may be improved in the future, once Data Source API v2 is ready: