I would like to process a huge order CSV file (5GB), with some metadata rows at the start of file. Header columns are represented in row 4 (starting with "h,") followed by another metadata row, describing optionality. Data rows start with "d,"
m,Version,v1.0
m,Type,xx
m,<OtherMetaData>,<...>
h,Col1,Col2,Col3,Col4,Col5,.............,Col100
m,Mandatory,Optional,Optional,...........,Mandatory
d,Val1,Val2,Val3,Val4,Val5,.............,Val100
Is it possible to skip a specified number of rows when loading the file and use 'inferSchema' option for DataSet?
Dataset<Row> df = spark.read()
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("\home\user\data\20170326.csv");
Or do I need to define two different Datasets and use "except(Dataset other)" to exclude the dataset with rows to be ignored?
You can try setting the "comment"
option to "m"
, effectively telling the csv reader to skip lines beginning with the "m"
character.
df = spark.read()
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("comment", "m")
.load("\home\user\data\20170326.csv")
来源:https://stackoverflow.com/questions/43029020/apache-spark-dataframe-load-data-from-nth-line-of-a-csv-file