发表新帖

发表新帖

More than one hour to execute pyspark.sql.DataFrame.take(4)

后端未结

关注

 1  797

I am running spark 1.6 on 3 VMs (i.e. 1x master; 2x slaves) all with 4 cores and 16GB RAM.

I can see the workers registered on spark-master webUI.

I want to

相关标签:

1条回答

春和景丽

2020-11-28 13:31
While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4) will fetch data first and then apply the limit. In other words your database will execute (assuming no projections an filters) something equivalent to:
```
SELECT * FROM table 
```
and the rest will handled by Spark. There are some optimizations involved (in particular Spark evaluates partitions iteratively to obtain number of records requested by LIMIT) but it still quite inefficient process compared to database-side optimizations.

If you want to push limit to the database you'll have to do it statically using subquery as a dbtable parameter:
```
(sqlContext.read.format('jdbc')
    .options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))
```
```
sqlContext.read.format("jdbc").options(Map(
  "url"     -> "xxxx",
  "dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp",
))
```
Please note that an alias in subquery is mandatory.

Note:

This behavior may be improved in the future, once Data Source API v2 is ready:
- SPARK-15689
- SPIP: Data Source API V2
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题