问题
I use Spark 2.3.0.
As a Apache Spark's project I am using this data set to work on. When trying to read csv using spark, row in spark dataframe does not corresponds to correct row in csv (See sample csv here) file. Code looks like following:
answer_df = sparkSession.read.csv('./stacksample/Answers_sample.csv', header=True, inferSchema=True, multiLine=True);
answer_df.show(2)
Output
+--------------------+-------------+--------------------+--------+-----+--------------------+
| Id| OwnerUserId| CreationDate|ParentId|Score| Body|
+--------------------+-------------+--------------------+--------+-----+--------------------+
| 92| 61|2008-08-01T14:45:37Z| 90| 13|"<p><a href=""htt...|
|<p>A very good re...| though.</p>"| null| null| null| null|
+--------------------+-------------+--------------------+--------+-----+--------------------+
only showing top 2 rows
However, When i used pandas, it worked like a charm.
df = pd.read_csv('./stacksample/Answers_sample.csv')
df.head(3)
Output
Index Id OwnerUserId CreationDate ParentId Score Body
0 92 61 2008-08-01T14:45:37Z 90 13 <p><a href="http://svnbook.red-bean.com/">Vers...
1 124 26 2008-08-01T16:09:47Z 80 12 <p>I wound up using this. It is a kind of a ha...
My Observation: Apache spark is treating every line in csv file as a record for dataframe( which is reasonable) but on the other hand, pandas intelligently ( not sure based on which parameters) figures out where the record end actually.
Question I would like to know, how can i instruct Spark to load the dataframe properly.
The data to be loaded is as follows with the lines starting with 92
and 124
being two records.
Id,OwnerUserId,CreationDate,ParentId,Score,Body
92,61,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Version Control with Subversion</a></p>
<p>A very good resource for source control in general. Not really TortoiseSVN specific, though.</p>"
124,26,2008-08-01T16:09:47Z,80,12,"<p>I wound up using this. It is a kind of a hack, but it actually works pretty well. The only thing is you have to be very careful with your semicolons. : D</p>
<pre><code>var strSql:String = stream.readUTFBytes(stream.bytesAvailable);
var i:Number = 0;
var strSqlSplit:Array = strSql.split("";"");
for (i = 0; i < strSqlSplit.length; i++){
NonQuery(strSqlSplit[i].toString());
}
</code></pre>
"
回答1:
I think you should use option("escape", "\"")
as it seems that "
is used as so-called quote escape characters.
val q = spark.read
.option("multiLine", true)
.option("header", true)
.option("escape", "\"")
.csv("input.csv")
scala> q.show
+---+-----------+--------------------+--------+-----+--------------------+
| Id|OwnerUserId| CreationDate|ParentId|Score| Body|
+---+-----------+--------------------+--------+-----+--------------------+
| 92| 61|2008-08-01T14:45:37Z| 90| 13|<p><a href="http:...|
|124| 26|2008-08-01T16:09:47Z| 80| 12|<p>I wound up usi...|
+---+-----------+--------------------+--------+-----+--------------------+
回答2:
After few hours of struggle I was able to figure out the solution.
Analysis:
Data dump provided by Stackoverflow
had quote(")
being escaped by a another quote(")
. And since spark uses slash(\)
as default value for escape character, which i was not passing, therefore it end up in giving meaningless output.
Updated code
answer_df = sparkSession.read.\
csv('./stacksample/Answers_sample.csv',
inferSchema=True, header=True, multiLine=True, escape='"');
answer_df.show(2)
Note the use of escape
parameter in csv()
.
Output
+---+-----------+-------------------+--------+-----+--------------------+
| Id|OwnerUserId| CreationDate|ParentId|Score| Body|
+---+-----------+-------------------+--------+-----+--------------------+
| 92| 61|2008-08-01 20:15:37| 90| 13|<p><a href="http:...|
|124| 26|2008-08-01 21:39:47| 80| 12|<p>I wound up usi...|
+---+-----------+-------------------+--------+-----+--------------------+
Hope it will help other and save some time for them.
来源:https://stackoverflow.com/questions/50188579/how-to-load-csv-file-with-records-on-multiple-lines