DataFrame filtering based on second Dataframe

问题

Using Spark SQL, I have two dataframes, they are created from one, such as:

df = sqlContext.createDataFrame(...);
df1 = df.filter("value = 'abc'"); //[path, value]
df2 = df.filter("value = 'qwe'"); //[path, value]

I want to filter df1, if part of its 'path' is any path in df2. So if df1 has row with path 'a/b/c/d/e' I would find out if in df2 is a row that path is 'a/b/c'. In SQL it should be like

SELECT * FROM df1 WHERE udf(path) IN (SELECT path FROM df2)

where udf is user defined function that shorten original path from df1. Naive solution is to use JOIN and then filter result, but it is slow, since df1 and df2 have each more than 10mil rows.

I also tried following code, but firstly I had to create broadcast variable from df2

static Broadcast<DataFrame> bdf;
bdf = sc.broadcast(df2); //variable 'sc' is JavaSparkContext 

sqlContext.createDataFrame(df1.javaRDD().filter(
         new Function<Row, Boolean>(){
             @Override
             public Boolean call(Row row) throws Exception {
                 String foo = shortenPath(row.getString(0));
                 return bdf.value().filter("path = '"+foo+"'").count()>0;
             }
          }
    ), myClass.class)

the problem I'm having is that Spark got stuck when the return was evaluated/when filtering of df2 was performed.

I would like to know how to work with two dataframes to do this. I really want to avoid JOIN. Any ideas?

EDIT>>

In my original code df1 has alias 'first' and df2 'second'. This join is not cartesian, and it also does not use broadcast.

df1 = df1.as("first");
df2 = df2.as("second");

    df1.join(df2, df1.col("first.path").
                                lt(df2.col("second.path"))
                                      , "left_outer").
                    filter("isPrefix(first.path, second.path)").
                    na().drop("any");

isPrefix is udf

UDF2 isPrefix = new UDF2<String, String, Boolean>() {
        @Override
        public Boolean call(String p, String s) throws Exception {
            //return true if (p.length()+4==s.length()) and s.contains(p)
        }};

shortenPath - it cuts last two characters in path

UDF1 shortenPath = new UDF1<String, String>() {
        @Override
        public String call(String s) throws Exception {
            String[] foo = s.split("/");
            String result = "";
            for (int i = 0; i < foo.length-2; i++) {
                result += foo[i];
                if(i<foo.length-3) result+="/";
            }
            return result;
        }
    };

Example of records. Path is unique.

a/a/a/b/c abc
a/a/a     qwe
a/b/c/d/e abc
a/b/c     qwe
a/b/b/k   foo
a/b/f/a   bar
...

So df1 consits of

a/a/a/b/c abc
a/b/c/d/e abc
...

and df2 consits of

a/a/a     qwe
a/b/c     qwe
...

回答1:

There at least few problems with your code:

you cannot execute action or transformation inside another action or transformation. It means that filtering broadcasted DataFrame simply cannot work and you should get an exception.
join you use is executed as a Cartesian product followed by filter. Since Spark is using Hashing for joins only equality based joins can be efficiently executed without Cartesian. It is slightly related to Why using a UDF in a SQL query leads to cartesian product?
if both DataFrames are relatively large and have similar size then broadcasting is unlikely to be useful. See Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark
not important when it comes to performance but isPrefix seems to wrong. In particular it looks like it can match both prefix and suffix
col("first.path").lt(col("second.path")) condition looks wrong. I assume you want a/a/a/b/c from df1 match a/a/a from df2. If so it should be gt not lt.

Probably the best thing you can do is something similar to this:

import org.apache.spark.sql.functions.{col, regexp_extract}

val df = sc.parallelize(Seq(
    ("a/a/a/b/c", "abc"), ("a/a/a","qwe"),
    ("a/b/c/d/e", "abc"), ("a/b/c", "qwe"),
    ("a/b/b/k", "foo"), ("a/b/f/a", "bar")
)).toDF("path", "value")

val df1 = df
    .where(col("value") === "abc")    
    .withColumn("path_short", regexp_extract(col("path"), "^(.*)(/.){2}$", 1))
    .as("df1")

val df2 = df.where(col("value") === "qwe").as("df2")
val joined = df1.join(df2, col("df1.path_short") === col("df2.path"))

You can try to broadcast one of the tables like this (Spark >= 1.5.0 only):

import org.apache.spark.sql.functions.broadcast

df1.join(broadcast(df2), col("df1.path_short") === col("df2.path"))

and increase auto broadcast limits, but as I've mentioned above it most likely will be less efficient than plain HashJoin.

回答2:

As a possible way of implementing IN with subquery, the LEFT SEMI JOIN can be used:

    JavaSparkContext javaSparkContext = new JavaSparkContext("local", "testApp");
    SQLContext sqlContext = new SQLContext(javaSparkContext);
    StructType schema = DataTypes.createStructType(new StructField[]{
            DataTypes.createStructField("path", DataTypes.StringType, false),
            DataTypes.createStructField("value", DataTypes.StringType, false)
    });
    // Prepare First DataFrame
    List<Row> dataForFirstDF = new ArrayList<>();
    dataForFirstDF.add(RowFactory.create("a/a/a/b/c", "abc"));
    dataForFirstDF.add(RowFactory.create("a/b/c/d/e", "abc"));
    dataForFirstDF.add(RowFactory.create("x/y/z", "xyz"));
    DataFrame df1 = sqlContext.createDataFrame(javaSparkContext.parallelize(dataForFirstDF), schema);
    // 
    df1.show();
    //
    // +---------+-----+
    // |     path|value|
    // +---------+-----+
    // |a/a/a/b/c|  abc|
    // |a/b/c/d/e|  abc|
    // |    x/y/z|  xyz|
    // +---------+-----+

    // Prepare Second DataFrame
    List<Row> dataForSecondDF = new ArrayList<>();
    dataForSecondDF.add(RowFactory.create("a/a/a", "qwe"));
    dataForSecondDF.add(RowFactory.create("a/b/c", "qwe"));
    DataFrame df2 = sqlContext.createDataFrame(javaSparkContext.parallelize(dataForSecondDF), schema);

    // Use left semi join to filter out df1 based on path in df2
    Column pathContains = functions.column("firstDF.path").contains(functions.column("secondDF.path"));
    DataFrame result = df1.as("firstDF").join(df2.as("secondDF"), pathContains, "leftsemi");

    //
    result.show();
    //
    // +---------+-----+
    // |     path|value|
    // +---------+-----+
    // |a/a/a/b/c|  abc|
    // |a/b/c/d/e|  abc|
    // +---------+-----+

The Physical Plan of such query will look like this:

== Physical Plan ==
Limit 21
 ConvertToSafe
  LeftSemiJoinBNL Some(Contains(path#0, path#2))
   ConvertToUnsafe
    Scan PhysicalRDD[path#0,value#1]
   TungstenProject [path#2]
    Scan PhysicalRDD[path#2,value#3]

It will use the LeftSemiJoinBNL for the actual join operation, which should broadcast values internally. From more details refer to the actual implementation in Spark - LeftSemiJoinBNL.scala

P.S. I didn't quite understand the need for removing the last two characters, but if that's needed - it can be done, like @zero323 proposed (using regexp_extract).

来源：https://stackoverflow.com/questions/34302547/dataframe-filtering-based-on-second-dataframe

标签

java

apache-spark

dataframe

apache-spark-sql

spark-dataframe