问题
I am using Spark to write out data into partitions. Given a dataset with two columns (foo, bar)
, if I do df.write.mode("overwrite").format("csv").partitionBy("foo").save("/tmp/output")
, I get an output of
/tmp/output/foo=1/X.csv
/tmp/output/foo=2/Y.csv
...
However, the output CSV files only contain the value for bar
, not foo
. I know the value of foo
is already captured in the directory name foo=N
, but is it possible to also include the value of foo
in the CSV file?
回答1:
Only if you make a copy under different name:
(df
.withColumn("foo_", col("foo"))
.write.mode("overwrite")
.format("csv").partitionBy("foo_").save("/tmp/output"))
来源:https://stackoverflow.com/questions/48190107/spark-can-you-include-partition-columns-in-output-files