Pyspark clean data within dataframe

问题

I have the following file data.json which I try to clean using Pyspark.

{"positionmessage":{"callsign": "PPH1", "name": "testschip-10", "mmsi": 100,"timestamplast": "2019-08-01T00:00:08Z"}}
{"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast": "2019-08-01T00:00:01Z"}}
{"positionmessage":{"callsign": "PPH3", "name": "testschip-10", "mmsi": 300,"timestamplast": "2019-08-01T00:00:05Z"}}
{"positionmessage":{"callsign":       , "name":               , "mmsi": 200,"timestamplast": "2019-08-01T20:00:05Z"}}
{"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast": "2019-08-01T00:00:01Z"}}
{"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast":                       }}

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DateType, FloatType, TimestampType
import pyspark.sql.functions as f

appName = "PySpark Example - JSON file to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

schema = StructType([
    StructField("positionmessage",
    StructType([
    StructField('callsign', StringType(), True),
    StructField('name', StringType(), True),
    StructField('timestamplast', TimestampType(), True),    
    StructField('mmsi', IntegerType(), True)
    ]))])

file_name = "data.json"
df = spark.read.schema(schema).json(file_name).select("positionmessage.*")
df = df.withColumn("name", f.split(df['name'], '\-')[1]).show() # strips the string "testschip-"

The timestamplast is not correct, due to T being present between day and hour. How do I fix this? Furthermore, I want to do the following operations: 1) s

Specify dtype of "name" after "testschip-" has been removed as integer
Drop duplicates - when timestamplast is the same for a certain name, it should be removed from the dataframe.
If there is a missing timestamp, then the whole row should be removed from the dataframe.
forward or backward fill of missing numbers within the group "name". Timestamplast should not be forward/backward filled (duplicates and missing numbers are already removed)
Sort by timestamplast within the group "name" (timestamps must increase for a given name)
I want to add a new column called "time_delta" which gives the timedifference between succesive "timestamplasts" within each group "name" with respect to the previous record.

来源：https://stackoverflow.com/questions/61899539/pyspark-clean-data-within-dataframe

标签

pyspark

data-cleaning