Is there any option to preserve line breaks within quotation marks when reading multiline CSV files in Spark?

大城市里の小女人 提交于 2019-12-08 10:53:52

问题


I have some CSV file with line break within quotation marks in third line (first line is CSV header).

data/testdata.csv

"id", "description"
"1", "some description"
"2", "other description with line
break"

Regardless if its correct CSV or not, I must parse it into valid records. That's what I tried

public class Main2 {

    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder()
            .master("local[4]")
            .getOrCreate();
        Dataset<Row> rows = spark
            .read()
            .format("csv")
            .option("header", "true")
            .load("data/testdata.csv");

        rows
            .foreach(row -> System.out.println(row));
    }
}

And the output is like below:

[1, "some description"]
[2, "other description with line]
[break",null]

As you can see, Spark treats break" as a new record and fills missing columns with null. The question is: is there any option to Spark's CSV parser that allows such line breaks?

I tried the code below (reference) but it doesn't work

Dataset<Row> rows = spark.read()
    .option("parserLib", "univocity")
    .option("multiLine", "true")
    .csv("data/testdata.csv");

回答1:


According to this article since spark 2.2.0 there is possibility for parsing multiline csv files. In my case these settings do the job:

sparkSession
    .read()
    .option("sep", ";")
    .option("quote", "\"")
    .option("multiLine", "true")
    .option("ignoreLeadingWhiteSpace", true)
    .csv(path.toString());


来源:https://stackoverflow.com/questions/53818894/is-there-any-option-to-preserve-line-breaks-within-quotation-marks-when-reading

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!