发表新帖

发表新帖

pyspark: isin vs join

前端未结

关注

 1  1870

甜味超标 2021-02-05 12:16

What are general best-practices to filtering a dataframe in pyspark by a given list of values? Specifically:

Depending on the size of the given list of values, then with

1条回答

小鲜肉 (楼主)

2021-02-05 12:35
Considering
```
import pyspark.sql.functions as psf
```
There are two types of broadcasting:
- sc.broadcast() to copy python objects to every node for a more efficient use of psf.isin
- psf.broadcast inside a join to copy your pyspark dataframe to every node when the dataframe is small: df1.join(psf.broadcast(df2)). It is usually used for cartesian products (CROSS JOIN in pig).
In the context question, the filtering was done using the column of another dataframe, hence the possible solution with a join.

Keep in mind that if your filtering list is relatively big the operation of searching through it will take a while, and since it has do be done for each row it can quickly get costly.

Joins on the other hand involve two dataframes that will be sorted before matching, so if your list is small enough you might not want to have to sort a huge dataframe just for a filter.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题