How can I write a parquet file using Spark (pyspark)?

后端未结

关注

 2  1259

I\'m pretty new in Spark and I\'ve been trying to convert a Dataframe to a parquet file in Spark but I haven\'t had success yet. The documentation says that I can use

相关标签:

2条回答

悲&欢浪女

2020-12-29 22:40
You can also write out Parquet files from Spark with koalas. This library is great for folks that prefer Pandas syntax. Koalas is PySpark under the hood.

Here's the Koala code:
```
import databricks.koalas as ks

df = ks.read_csv('/temp/proto_temp.csv')
df.to_parquet('output/proto.parquet')
```
Read this blog post if you'd like more details.
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦如初夏

2020-12-29 22:46
The error was due to the fact that the textFile method from SparkContext returned an RDD and what I needed was a DataFrame.

SparkSession has a SQLContext under the hood. So I needed to use the DataFrameReader to read the CSV file correctly before converting it to a parquet file.
```
spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# read csv
df = spark.read.csv("/temp/proto_temp.csv")

# Displays the content of the DataFrame to stdout
df.show()

df.write.parquet("output/proto.parquet")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...