Working with a StructType column in PySpark UDF

问题

I have the following schema for one of columns that I'm processing,

 |-- time_to_resolution_remainingTime: struct (nullable = true)
 |    |-- _links: struct (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |-- completedCycles: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- breached: boolean (nullable = true)
 |    |    |    |-- elapsedTime: struct (nullable = true)
 |    |    |    |    |-- friendly: string (nullable = true)
 |    |    |    |    |-- millis: long (nullable = true)
 |    |    |    |-- goalDuration: struct (nullable = true)
 |    |    |    |    |-- friendly: string (nullable = true)
 |    |    |    |    |-- millis: long (nullable = true)
 |    |    |    |-- remainingTime: struct (nullable = true)
 |    |    |    |    |-- friendly: string (nullable = true)
 |    |    |    |    |-- millis: long (nullable = true)
 |    |    |    |-- startTime: struct (nullable = true)
 |    |    |    |    |-- epochMillis: long (nullable = true)
 |    |    |    |    |-- friendly: string (nullable = true)
 |    |    |    |    |-- iso8601: string (nullable = true)
 |    |    |    |    |-- jira: string (nullable = true)
 |    |    |    |-- stopTime: struct (nullable = true)
 |    |    |    |    |-- epochMillis: long (nullable = true)
 |    |    |    |    |-- friendly: string (nullable = true)
 |    |    |    |    |-- iso8601: string (nullable = true)
 |    |    |    |    |-- jira: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- ongoingCycle: struct (nullable = true)
 |    |    |-- breachTime: struct (nullable = true)
 |    |    |    |-- epochMillis: long (nullable = true)
 |    |    |    |-- friendly: string (nullable = true)
 |    |    |    |-- iso8601: string (nullable = true)
 |    |    |    |-- jira: string (nullable = true)
 |    |    |-- breached: boolean (nullable = true)
 |    |    |-- elapsedTime: struct (nullable = true)
 |    |    |    |-- friendly: string (nullable = true)
 |    |    |    |-- millis: long (nullable = true)
 |    |    |-- goalDuration: struct (nullable = true)
 |    |    |    |-- friendly: string (nullable = true)
 |    |    |    |-- millis: long (nullable = true)
 |    |    |-- paused: boolean (nullable = true)
 |    |    |-- remainingTime: struct (nullable = true)
 |    |    |    |-- friendly: string (nullable = true)
 |    |    |    |-- millis: long (nullable = true)
 |    |    |-- startTime: struct (nullable = true)
 |    |    |    |-- epochMillis: long (nullable = true)
 |    |    |    |-- friendly: string (nullable = true)
 |    |    |    |-- iso8601: string (nullable = true)
 |    |    |    |-- jira: string (nullable = true)
 |    |    |-- withinCalendarHours: boolean (nullable = true)

I'm interested in getting the time fields (e.g completedCycles[x].elapsedTime, ongoingCycle.remainingTime) etc, based on certain conditions. The UDF I'm using is:

@udf("string")
def extract_time(s, field):
  # Return ongoing cycle field
  if has_column(s, 'ongoingCycle'):
    field = 'ongoingCycle.{}'.format(field)
    return s[field]
  
  # return last element of completed cycles
  s = s.get(size(s) - 1)
  return s[field]

cl = 'time_to_resolution_remainingTime'
df = df.withColumn(cl, extract_time(cl, lit("elapsedTime.friendly"))).select(cl)
display(df)

This results in an error:

SparkException: Job aborted due to stage failure: Task 0 in stage 549.0 failed 4 times, most recent failure: Lost task 0.3 in stage 549.0 (TID 1597, 10.155.239.76, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/sql/types.py", line 1514, in __getitem__
    idx = self.__fields__.index(item)
ValueError: 'ongoingCycle.elapsedTime.friendly' is not in list

I'm obviously doing something terribly wrong here, but I'm unable to resolve this. Is it possible to convert the s data frame in the UDF to a python dictionary and perform calculations on that? or is there a much better way to do this?

Edit:

Sample Data

{
   "_links":{
      "self":"https:///...."
   },
   "completedCycles":[
      
   ],
   "id":"630",
   "name":"Time to resolution",
   "ongoingCycle":{
      "breachTime":{
         "epochMillis":1605583651354,
         "friendly":"17/Nov/20 3:27 PM +12:00",
         "iso8601":"2020-11-17T15:27:31+1200",
         "jira":"2020-11-17T15:27:31.354+1200"
      },
      "breached":true,
      "elapsedTime":{
         "friendly":"57h 32m",
         "millis":207148646
      },
      "goalDuration":{
         "friendly":"4h",
         "millis":14400000
      },
      "paused":false,
      "remainingTime":{
         "friendly":"-53h 32m",
         "millis":-192748646
      },
      "startTime":{
         "epochMillis":1605511651354,
         "friendly":"16/Nov/20 7:27 PM +12:00",
         "iso8601":"2020-11-16T19:27:31+1200",
         "jira":"2020-11-16T19:27:31.354+1200"
      },
      "withinCalendarHours":false
   }
}

Expected output: -53h 23m

With completed cycles but no ongoing cycle

{
   "_links":{
      "self":"https://...."
   },
   "completedCycles":[
      {
         "breached":true,
         "elapsedTime":{
            "friendly":"72h 43m",
            "millis":261818073
         },
         "goalDuration":{
            "friendly":"4h",
            "millis":14400000
         },
         "remainingTime":{
            "friendly":"-68h 43m",
            "millis":-247418073
         },
         "startTime":{
            "epochMillis":1605156449463,
            "friendly":"12/Nov/20 4:47 PM +12:00",
            "iso8601":"2020-11-12T16:47:29+1200",
            "jira":"2020-11-12T16:47:29.463+1200"
         },
         "stopTime":{
            "epochMillis":1606282267536,
            "friendly":"Today 5:31 PM +12:00",
            "iso8601":"2020-11-25T17:31:07+1200",
            "jira":"2020-11-25T17:31:07.536+1200"
         }
      }
   ],
   "id":"630",
   "name":"Time to resolution",
   "ongoingCycle": null
}

Expected output: -68h 43m

I got this code to work but not sure if it's the best way to go about solving this,

@udf("string")
def extract_time(s, field):
  if s is None:
    return None
  
  # Return ongoing cycle field
  if has_column(s, 'ongoingCycle'):
    if s['ongoingCycle'] is not None:
      return s['ongoingCycle']['remainingTime']['friendly']
    
  # Get the last completed cycles' remaining time
  s_completed = s['completedCycles']
  if len(s_completed) > 0:
    return s_completed[-1]['remainingTime']['friendly']
  return None

回答1:

Use when function to check same logic as you have implemented in UDF.

Check below code.

df.show()
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_links         |completedCycles                                                                                                                                                                                                                                                         |id |name              |ongoingCycle                                                                                                                                                                                                                                                                            |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[https:///....]|[]                                                                                                                                                                                                                                                                      |630|Time to resolution|[[1605583651354, 17/Nov/20 3:27 PM +12:00, 2020-11-17T15:27:31+1200, 2020-11-17T15:27:31.354+1200], true, [57h 32m, 207148646], [4h, 14400000], false, [-53h 32m, -192748646], [1605511651354, 16/Nov/20 7:27 PM +12:00, 2020-11-16T19:27:31+1200, 2020-11-16T19:27:31.354+1200], false]|
|[https://....] |[[true, [72h 43m, 261818073], [4h, 14400000], [-68h 43m, -247418073], [1605156449463, 12/Nov/20 4:47 PM +12:00, 2020-11-12T16:47:29+1200, 2020-11-12T16:47:29.463+1200], [1606282267536, Today 5:31 PM +12:00, 2020-11-25T17:31:07+1200, 2020-11-25T17:31:07.536+1200]]]|630|Time to resolution|null                                                                                                                                                                                                                                                                                    |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


df.withColumn("time_to_resolution_remainingTime",F.expr("CASE WHEN ongoingCycle IS NOT NULL THEN ongoingCycle.elapsedTime.friendly WHEN size(completedCycles) > 0 THEN completedCycles[size(completedCycles)-1].remainingTime.friendly ELSE null END"))\ 
.select("time_to_resolution_remainingTime")\ 
.show(false)

+--------------------------------+
|time_to_resolution_remainingTime|
+--------------------------------+
|57h 32m                         |
|-68h 43m                        |
+--------------------------------+

来源：https://stackoverflow.com/questions/65007134/working-with-a-structtype-column-in-pyspark-udf

标签

apache-spark

pyspark

user-defined-functions