问题
I am trying to read a comma delimited csv file using pyspark version 2.4.5 and Databrick's spark-csv module. One of the field in the csv file has a json object as its value. The contents of the csv are as below
test.csv
header_col_1, header_col_2, header_col_3
one, two, three
one, {“key1”:“value1",“key2”:“value2",“key3”:“value3”,“key4”:“value4"}, three
Other solutions that I found had read options defined as "escape": '"', and 'delimiter': ",". This seems not to be working as the commas in the field in question are not enclosed in double quotes. Below is the source code that I am using to read the csv file
test.py
from pyspark.sql import SparkSession
import findspark
findspark.init()
spark = SparkSession.builder.appName('test').getOrCreate()
read_options = {
'header': 'true',
"escape": '"',
'delimiter': ",",
'inferSchema': 'false',
}
spark_df = spark.read.format('com.databricks.spark.csv').options(**read_options).load('test.csv')
print(spark_df.show())
Output of the above program is as shown below
+------------+-----------------+---------------+
|header_col_1| header_col_2| header_col_3|
+------------+-----------------+---------------+
| one| two| three|
| one| {“key1”:“value1"|“key2”:“value2"|
+------------+-----------------+---------------+
回答1:
In the CSV file, you have to put the JSON string in straight double quotes. The double quotes in your JSON string must be escaped by backslashes (\"). Remove your escape option as it is incorrect. By default, the delimiter is set to "," the escape character to '\' and the quote character to '"'. Refer to Databricks documentation
回答2:
Delimiters between double quotes are ignored by default.
The solution to the issue is not so elegant and I guess it can be improved. What worked for me was a two-step process, the first step was reading the file as text using pyspark spark.read.text()
method. The the second step involved manipulating the Json object by replacing any double quotes inside the object with single quotes, wrap whole object in double quotes and then write the contents to a new csv file which I then read using the spark.read.format('com.databricks.spark.csv').options(**read_options).load('new.csv')
method.
Below is the code snippet for the program
from pyspark.sql import SparkSession
read_options = {
'header': 'true',
'escape': '"',
'delimiter': ",",
'inferSchema': 'false',
}
spark = SparkSession.builder.appName('test').getOrCreate()
sc = spark.sparkContext
lines = sc.textFile("test.csv").collect()
new_data = [
line.replace(' ', '').replace('“', "'").replace('”', "'").replace('"', "'").replace('{', '"{').replace('}', '}"') + '\n'
for line in lines]
with open('new.csv', 'w') as new_file:
new_file.writelines(new_data)
spark_df = spark.read.format('com.databricks.spark.csv').options(**read_options).load('new.csv')
spark_df.show(3, False)
The the above program produce the output below
+------------+-----------------------------------------------------------------+------------+
|header_col_1|header_col_2 |header_col_3|
+------------+-----------------------------------------------------------------+------------+
|one |two |three |
|one |{'key1':'value1','key2':'value2','key3':'value3','key4':'value4'}|three |
+------------+-----------------------------------------------------------------+------------+
来源:https://stackoverflow.com/questions/63042848/how-do-i-prevent-pyspark-from-interpreting-commas-as-a-delimiter-in-a-csv-field