问题
I have the following file data.json which I try to clean using Pyspark.
{"positionmessage":{"callsign": "PPH1", "name": "testschip-10", "mmsi": 100,"timestamplast": "2019-08-01T00:00:08Z"}}
{"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast": "2019-08-01T00:00:01Z"}}
{"positionmessage":{"callsign": "PPH3", "name": "testschip-10", "mmsi": 300,"timestamplast": "2019-08-01T00:00:05Z"}}
{"positionmessage":{"callsign": , "name": , "mmsi": 200,"timestamplast": "2019-08-01T20:00:05Z"}}
{"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast": "2019-08-01T00:00:01Z"}}
{"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast": }}
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DateType, FloatType, TimestampType
import pyspark.sql.functions as f
appName = "PySpark Example - JSON file to Spark Data Frame"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
schema = StructType([
StructField("positionmessage",
StructType([
StructField('callsign', StringType(), True),
StructField('name', StringType(), True),
StructField('timestamplast', TimestampType(), True),
StructField('mmsi', IntegerType(), True)
]))])
file_name = "data.json"
df = spark.read.schema(schema).json(file_name).select("positionmessage.*")
df = df.withColumn("name", f.split(df['name'], '\-')[1]).show() # strips the string "testschip-"
The timestamplast is not correct, due to T being present between day and hour. How do I fix this? Furthermore, I want to do the following operations: 1) s
- Specify dtype of "name" after "testschip-" has been removed as integer
- Drop duplicates - when timestamplast is the same for a certain name, it should be removed from the dataframe.
- If there is a missing timestamp, then the whole row should be removed from the dataframe.
- forward or backward fill of missing numbers within the group "name". Timestamplast should not be forward/backward filled (duplicates and missing numbers are already removed)
- Sort by timestamplast within the group "name" (timestamps must increase for a given name)
- I want to add a new column called "time_delta" which gives the timedifference between succesive "timestamplasts" within each group "name" with respect to the previous record.
来源:https://stackoverflow.com/questions/61899539/pyspark-clean-data-within-dataframe