Implementing a recursive algorithm in pyspark to find pairings within a dataframe

后端 未结 2 1100
长发绾君心
长发绾君心 2021-02-06 09:32

I have a spark dataframe (prof_student_df) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each prof

2条回答
  •  夕颜
    夕颜 (楼主)
    2021-02-06 10:25

    Edit: As discussed in comments, to fix the issue mentioned in your update, we can convert student_id at each time into generalized sequence-id using dense_rank, go through Step 1 to 3 (using student column) and then use join to convert student at each time back to their original student_id. see below Step-0 and Step-4. in case there are less than 4 professors in a timeUnit, dimension will be resize to 4 in Numpy-end (using np_vstack() and np_zeros()), see the updated function find_assigned.

    You can try pandas_udf and scipy.optimize.linear_sum_assignment(note: the backend method is the Hungarian algorithm as mentioned by @cronoik in the main comments), see below:

    from pyspark.sql.functions import pandas_udf, PandasUDFType, first, expr, dense_rank
    from pyspark.sql.types import StructType
    from scipy.optimize import linear_sum_assignment
    from pyspark.sql import Window
    import numpy as np
    
    df = spark.createDataFrame([
        ('1596048041', 'p1', 's1', 0.7), ('1596048041', 'p1', 's2', 0.5), ('1596048041', 'p1', 's3', 0.3),
        ('1596048041', 'p1', 's4', 0.2), ('1596048041', 'p2', 's1', 0.9), ('1596048041', 'p2', 's2', 0.1),
        ('1596048041', 'p2', 's3', 0.15), ('1596048041', 'p2', 's4', 0.2), ('1596048041', 'p3', 's1', 0.2),
        ('1596048041', 'p3', 's2', 0.3), ('1596048041', 'p3', 's3', 0.4), ('1596048041', 'p3', 's4', 0.8),
        ('1596048041', 'p4', 's1', 0.2), ('1596048041', 'p4', 's2', 0.3), ('1596048041', 'p4', 's3', 0.35),
        ('1596048041', 'p4', 's4', 0.4)
    ] , ['time', 'professor_id', 'student_id', 'score'])
    
    N = 4
    cols_student = [*range(1,N+1)]
    

    Step-0: add an extra column student, and create a new dataframe df3 with all unique combos of time + student_id + student.

    w1 = Window.partitionBy('time').orderBy('student_id')
    
    df = df.withColumn('student', dense_rank().over(w1))
    +----------+------------+----------+-----+-------+                              
    |      time|professor_id|student_id|score|student|
    +----------+------------+----------+-----+-------+
    |1596048041|          p1|        s1|  0.7|      1|
    |1596048041|          p2|        s1|  0.9|      1|
    |1596048041|          p3|        s1|  0.2|      1|
    |1596048041|          p4|        s1|  0.2|      1|
    |1596048041|          p1|        s2|  0.5|      2|
    |1596048041|          p2|        s2|  0.1|      2|
    |1596048041|          p3|        s2|  0.3|      2|
    |1596048041|          p4|        s2|  0.3|      2|
    |1596048041|          p1|        s3|  0.3|      3|
    |1596048041|          p2|        s3| 0.15|      3|
    |1596048041|          p3|        s3|  0.4|      3|
    |1596048041|          p4|        s3| 0.35|      3|
    |1596048041|          p1|        s4|  0.2|      4|
    |1596048041|          p2|        s4|  0.2|      4|
    |1596048041|          p3|        s4|  0.8|      4|
    |1596048041|          p4|        s4|  0.4|      4|
    +----------+------------+----------+-----+-------+
    
    df3 = df.select('time','student_id','student').dropDuplicates()
    +----------+----------+-------+                                                 
    |      time|student_id|student|
    +----------+----------+-------+
    |1596048041|        s1|      1|
    |1596048041|        s2|      2|
    |1596048041|        s3|      3|
    |1596048041|        s4|      4|
    +----------+----------+-------+
    

    Step-1: use pivot to find the matrix of professors vs students, notice we set negative of scores to the values of pivot so that we can use scipy.optimize.linear_sum_assignment to find the min cost of an assignment problem:

    df1 = df.groupby('time','professor_id').pivot('student', cols_student).agg(-first('score'))
    +----------+------------+----+----+-----+----+
    |      time|professor_id|   1|   2|    3|   4|
    +----------+------------+----+----+-----+----+
    |1596048041|          p4|-0.2|-0.3|-0.35|-0.4|
    |1596048041|          p2|-0.9|-0.1|-0.15|-0.2|
    |1596048041|          p1|-0.7|-0.5| -0.3|-0.2|
    |1596048041|          p3|-0.2|-0.3| -0.4|-0.8|
    +----------+------------+----+----+-----+----+
    

    Step-2: use pandas_udf and scipy.optimize.linear_sum_assignment to get column indices and then assign the corresponding column name to a new column assigned:

    # returnSchema contains one more StringType column `assigned` than schema from the input pdf:
    schema = StructType.fromJson(df1.schema.jsonValue()).add('assigned', 'string')
    
    # since the # of students are always N, we can use np.vstack to set the N*N matrix
    # below `n` is the number of professors/rows in pdf
    # sz is the size of input Matrix, sz=4 in this example
    def __find_assigned(pdf, sz):
      cols = pdf.columns[2:]
      n = pdf.shape[0]
      n1 = pdf.iloc[:,2:].fillna(0).values
      _, idx = linear_sum_assignment(np.vstack((n1,np.zeros((sz-n,sz)))))
      return pdf.assign(assigned=[cols[i] for i in idx][:n])
    
    find_assigned = pandas_udf(lambda x: __find_assigned(x,N), schema, PandasUDFType.GROUPED_MAP)
    
    df2 = df1.groupby('time').apply(find_assigned)
    +----------+------------+----+----+-----+----+--------+
    |      time|professor_id|   1|   2|    3|   4|assigned|
    +----------+------------+----+----+-----+----+--------+
    |1596048041|          p4|-0.2|-0.3|-0.35|-0.4|       3|
    |1596048041|          p2|-0.9|-0.1|-0.15|-0.2|       1|
    |1596048041|          p1|-0.7|-0.5| -0.3|-0.2|       2|
    |1596048041|          p3|-0.2|-0.3| -0.4|-0.8|       4|
    +----------+------------+----+----+-----+----+--------+
    

    Note: per suggestion from @OluwafemiSule, we can use the parameter maximize instead of negate the score values. this parameter is available SciPy 1.4.0+:

      _, idx = linear_sum_assignment(np.vstack((n1,np.zeros((N-n,N)))), maximize=True)
    

    Step-3: use SparkSQL stack function to normalize the above df2, negate the score values and filter rows with score is NULL. the desired is_match column should have assigned==student:

    df_new = df2.selectExpr(
      'time',
      'professor_id',
      'assigned',
      'stack({},{}) as (student, score)'.format(len(cols_student), ','.join("int('{0}'), -`{0}`".format(c) for c in cols_student))
    ) \
    .filter("score is not NULL") \
    .withColumn('is_match', expr("assigned=student"))
    
    df_new.show()
    +----------+------------+--------+-------+-----+--------+
    |      time|professor_id|assigned|student|score|is_match|
    +----------+------------+--------+-------+-----+--------+
    |1596048041|          p4|       3|      1|  0.2|   false|
    |1596048041|          p4|       3|      2|  0.3|   false|
    |1596048041|          p4|       3|      3| 0.35|    true|
    |1596048041|          p4|       3|      4|  0.4|   false|
    |1596048041|          p2|       1|      1|  0.9|    true|
    |1596048041|          p2|       1|      2|  0.1|   false|
    |1596048041|          p2|       1|      3| 0.15|   false|
    |1596048041|          p2|       1|      4|  0.2|   false|
    |1596048041|          p1|       2|      1|  0.7|   false|
    |1596048041|          p1|       2|      2|  0.5|    true|
    |1596048041|          p1|       2|      3|  0.3|   false|
    |1596048041|          p1|       2|      4|  0.2|   false|
    |1596048041|          p3|       4|      1|  0.2|   false|
    |1596048041|          p3|       4|      2|  0.3|   false|
    |1596048041|          p3|       4|      3|  0.4|   false|
    |1596048041|          p3|       4|      4|  0.8|    true|
    +----------+------------+--------+-------+-----+--------+
    

    Step-4: use join to convert student back to student_id (use broadcast join if possible):

    df_new = df_new.join(df3, on=["time", "student"])
    +----------+-------+------------+--------+-----+--------+----------+            
    |      time|student|professor_id|assigned|score|is_match|student_id|
    +----------+-------+------------+--------+-----+--------+----------+
    |1596048041|      1|          p1|       2|  0.7|   false|        s1|
    |1596048041|      2|          p1|       2|  0.5|    true|        s2|
    |1596048041|      3|          p1|       2|  0.3|   false|        s3|
    |1596048041|      4|          p1|       2|  0.2|   false|        s4|
    |1596048041|      1|          p2|       1|  0.9|    true|        s1|
    |1596048041|      2|          p2|       1|  0.1|   false|        s2|
    |1596048041|      3|          p2|       1| 0.15|   false|        s3|
    |1596048041|      4|          p2|       1|  0.2|   false|        s4|
    |1596048041|      1|          p3|       4|  0.2|   false|        s1|
    |1596048041|      2|          p3|       4|  0.3|   false|        s2|
    |1596048041|      3|          p3|       4|  0.4|   false|        s3|
    |1596048041|      4|          p3|       4|  0.8|    true|        s4|
    |1596048041|      1|          p4|       3|  0.2|   false|        s1|
    |1596048041|      2|          p4|       3|  0.3|   false|        s2|
    |1596048041|      3|          p4|       3| 0.35|    true|        s3|
    |1596048041|      4|          p4|       3|  0.4|   false|        s4|
    +----------+-------+------------+--------+-----+--------+----------+
    
    df_new = df_new.drop("student", "assigned")
    

提交回复
热议问题