问题
Using Spark SQL, I have two dataframes, they are created from one, such as:
df = sqlContext.createDataFrame(...);
df1 = df.filter("value = 'abc'"); //[path, value]
df2 = df.filter("value = 'qwe'"); //[path, value]
I want to filter df1, if part of its 'path' is any path in df2. So if df1 has row with path 'a/b/c/d/e' I would find out if in df2 is a row that path is 'a/b/c'. In SQL it should be like
SELECT * FROM df1 WHERE udf(path) IN (SELECT path FROM df2)
where udf is user defined function that shorten original path from df1. Naive solution is to use JOIN and then filter result, but it is slow, since df1 and df2 have each more than 10mil rows.
I also tried following code, but firstly I had to create broadcast variable from df2
static Broadcast<DataFrame> bdf;
bdf = sc.broadcast(df2); //variable 'sc' is JavaSparkContext
sqlContext.createDataFrame(df1.javaRDD().filter(
new Function<Row, Boolean>(){
@Override
public Boolean call(Row row) throws Exception {
String foo = shortenPath(row.getString(0));
return bdf.value().filter("path = '"+foo+"'").count()>0;
}
}
), myClass.class)
the problem I'm having is that Spark got stuck when the return was evaluated/when filtering of df2 was performed.
I would like to know how to work with two dataframes to do this. I really want to avoid JOIN. Any ideas?
EDIT>>
In my original code df1 has alias 'first' and df2 'second'. This join is not cartesian, and it also does not use broadcast.
df1 = df1.as("first");
df2 = df2.as("second");
df1.join(df2, df1.col("first.path").
lt(df2.col("second.path"))
, "left_outer").
filter("isPrefix(first.path, second.path)").
na().drop("any");
isPrefix is udf
UDF2 isPrefix = new UDF2<String, String, Boolean>() {
@Override
public Boolean call(String p, String s) throws Exception {
//return true if (p.length()+4==s.length()) and s.contains(p)
}};
shortenPath - it cuts last two characters in path
UDF1 shortenPath = new UDF1<String, String>() {
@Override
public String call(String s) throws Exception {
String[] foo = s.split("/");
String result = "";
for (int i = 0; i < foo.length-2; i++) {
result += foo[i];
if(i<foo.length-3) result+="/";
}
return result;
}
};
Example of records. Path is unique.
a/a/a/b/c abc
a/a/a qwe
a/b/c/d/e abc
a/b/c qwe
a/b/b/k foo
a/b/f/a bar
...
So df1 consits of
a/a/a/b/c abc
a/b/c/d/e abc
...
and df2 consits of
a/a/a qwe
a/b/c qwe
...
回答1:
There at least few problems with your code:
- you cannot execute action or transformation inside another action or transformation. It means that filtering broadcasted
DataFrame
simply cannot work and you should get an exception. join
you use is executed as a Cartesian product followed by filter. Since Spark is usingHashing
for joins only equality based joins can be efficiently executed without Cartesian. It is slightly related to Why using a UDF in a SQL query leads to cartesian product?- if both
DataFrames
are relatively large and have similar size then broadcasting is unlikely to be useful. See Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark - not important when it comes to performance but
isPrefix
seems to wrong. In particular it looks like it can match both prefix and suffix col("first.path").lt(col("second.path"))
condition looks wrong. I assume you wanta/a/a/b/c
fromdf1
matcha/a/a
fromdf2
. If so it should begt
notlt
.
Probably the best thing you can do is something similar to this:
import org.apache.spark.sql.functions.{col, regexp_extract}
val df = sc.parallelize(Seq(
("a/a/a/b/c", "abc"), ("a/a/a","qwe"),
("a/b/c/d/e", "abc"), ("a/b/c", "qwe"),
("a/b/b/k", "foo"), ("a/b/f/a", "bar")
)).toDF("path", "value")
val df1 = df
.where(col("value") === "abc")
.withColumn("path_short", regexp_extract(col("path"), "^(.*)(/.){2}$", 1))
.as("df1")
val df2 = df.where(col("value") === "qwe").as("df2")
val joined = df1.join(df2, col("df1.path_short") === col("df2.path"))
You can try to broadcast one of the tables like this (Spark >= 1.5.0 only):
import org.apache.spark.sql.functions.broadcast
df1.join(broadcast(df2), col("df1.path_short") === col("df2.path"))
and increase auto broadcast limits, but as I've mentioned above it most likely will be less efficient than plain HashJoin
.
回答2:
As a possible way of implementing IN
with subquery, the LEFT SEMI JOIN
can be used:
JavaSparkContext javaSparkContext = new JavaSparkContext("local", "testApp");
SQLContext sqlContext = new SQLContext(javaSparkContext);
StructType schema = DataTypes.createStructType(new StructField[]{
DataTypes.createStructField("path", DataTypes.StringType, false),
DataTypes.createStructField("value", DataTypes.StringType, false)
});
// Prepare First DataFrame
List<Row> dataForFirstDF = new ArrayList<>();
dataForFirstDF.add(RowFactory.create("a/a/a/b/c", "abc"));
dataForFirstDF.add(RowFactory.create("a/b/c/d/e", "abc"));
dataForFirstDF.add(RowFactory.create("x/y/z", "xyz"));
DataFrame df1 = sqlContext.createDataFrame(javaSparkContext.parallelize(dataForFirstDF), schema);
//
df1.show();
//
// +---------+-----+
// | path|value|
// +---------+-----+
// |a/a/a/b/c| abc|
// |a/b/c/d/e| abc|
// | x/y/z| xyz|
// +---------+-----+
// Prepare Second DataFrame
List<Row> dataForSecondDF = new ArrayList<>();
dataForSecondDF.add(RowFactory.create("a/a/a", "qwe"));
dataForSecondDF.add(RowFactory.create("a/b/c", "qwe"));
DataFrame df2 = sqlContext.createDataFrame(javaSparkContext.parallelize(dataForSecondDF), schema);
// Use left semi join to filter out df1 based on path in df2
Column pathContains = functions.column("firstDF.path").contains(functions.column("secondDF.path"));
DataFrame result = df1.as("firstDF").join(df2.as("secondDF"), pathContains, "leftsemi");
//
result.show();
//
// +---------+-----+
// | path|value|
// +---------+-----+
// |a/a/a/b/c| abc|
// |a/b/c/d/e| abc|
// +---------+-----+
The Physical Plan of such query will look like this:
== Physical Plan ==
Limit 21
ConvertToSafe
LeftSemiJoinBNL Some(Contains(path#0, path#2))
ConvertToUnsafe
Scan PhysicalRDD[path#0,value#1]
TungstenProject [path#2]
Scan PhysicalRDD[path#2,value#3]
It will use the LeftSemiJoinBNL for the actual join operation, which should broadcast values internally. From more details refer to the actual implementation in Spark - LeftSemiJoinBNL.scala
P.S. I didn't quite understand the need for removing the last two characters, but if that's needed - it can be done, like @zero323 proposed (using regexp_extract
).
来源:https://stackoverflow.com/questions/34302547/dataframe-filtering-based-on-second-dataframe