PySpark: How to specify column with comma as decimal

问题

I am working with PySpark and loading a csv file. I have a column with numbers in European format, which means that comma replaces the dot and vice versa.

For example: I have 2.416,67 instead of 2,416.67.

My data in .csv file looks like this -    
ID;    Revenue
21;    2.645,45
23;   31.147,05
.
.
55;    1.009,11

In pandas, such a file can easily be read by specifying decimal=',' and thousands='.' options inside pd.read_csv() to read European formats.

Pandas code:

import pandas as pd
df=pd.read_csv("filepath/revenues.csv",sep=';',decimal=',',thousands='.')

I don't know how can this be done in PySpark.

PySpark code:

from pyspark.sql.types import StructType, StructField, FloatType, StringType
schema = StructType([
            StructField("ID", StringType(), True),
            StructField("Revenue", FloatType(), True)
                    ])
df=spark.read.csv("filepath/revenues.csv",sep=';',encoding='UTF-8', schema=schema, header=True)

Can anyone suggest as to how we can load such a file in PySpark using the above mentioned .csv() function?

回答1:

You won't be able to read it as a float because the format of the data. You need to read it as a string, clean it up and then cast to float:

from pyspark.sql.functions import regexp_replace
from pyspark.sql.types import FloatType

df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = df.withColumn('revenue', regexp_replace('revenue', '\\.', ''))
df = df.withColumn('revenue', regexp_replace('revenue', ',', '.'))
df = df.withColumn('revenue', df['revenue'].cast("float"))

You can probably just chain these all together too:

df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = (
         df
         .withColumn('revenue', regexp_replace('revenue', '\\.', ''))
         .withColumn('revenue', regexp_replace('revenue', ',', '.'))
         .withColumn('revenue', df['revenue'].cast("float"))
     )

Please note this I haven't tested this so there may be a typo or two in there.

回答2:

Make sure your SQL table is pre-formatted to read NUMERIC instead of INTEGER. I had a big trouble trying to figure out all about encoding and the different formats of dots and commas and etc. and in the end the problem was much more primitive, it was pre-formatted to read only INTEGER numbers, and therefore no decimals would ever be accepted, no matter if with commas or dots. Then I just had to change my SQL table to accept real numbers (NUMERIC) instead and that was it.

来源：https://stackoverflow.com/questions/52702608/pyspark-how-to-specify-column-with-comma-as-decimal

标签

csv

pyspark

format

comma