I am trying to use data from a spark dataframe as the input for my k-means model. However I keep getting errors. (Check section after code)
My spark dataframe and looks
you should maybe have continued on the same thread since it's the same problem. For reference : Preprocessing data in pyspark
Here you need to convert Latitude
/ Longitude
to float and remove null values with dropna
before injecting the data in Kmean, because it seems these columns contain some strings that cannot be cast to a numeric value, so preprocess df
with something like :
df2 = (df
.withColumn("Latitude", col("Latitude").cast("float"))
.withColumn("Longitude", col("Longitude").cast("float"))
.dropna()
)
spark_rdd = df2.rdd ...