Google Cloud Dataflow - From PubSub to Parquet

房东的猫 提交于 2021-01-29 17:46:17

问题


I'm trying to write Google PubSub messages to Google Cloud Storage using Google Cloud Dataflow. The PubSub messages come into json format and the only operation that I want to perform is a transformation from json to parquet file.

In the official documentation I found a template provided by google that reads data from a Pub/Sub topic and writes Avro files into the specified Cloud Storage bucket (https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#pubsub-to-cloud-storage-avro). The problem is that the template source code is written in Java, while I prefer to use the Python SDK.

These are the first tests I'm doing with Dataflow and Beam in general, and there's not a lot of material online to take a hint from. Any suggestions, links, guidance, piece of code would be greatly appreciated.


回答1:


In order to further contribute to the community, I am summarising our discussing as an answer.

Since you are starting with Dataflow, I can point out some useful topics and advice:

  1. The PTransform WriteToParquet() builtin method in Apache Beam is very useful. It writes to a Parquet file from a PCollection of records. Also, in order to use it and write to a parquet file, you would need to specify the schema as indicated in the documentation. In addition, this article will help you understand better how to use this method and how to write it in a Google Cloud Storage(GCS) bucket.

  2. Google provides this code explaining how read messages from PubSub and write them into Google Cloud Storage. This QuickStart reads the message from PubSub and write the messages from each window to a bucket.

  3. Since you want to read from PubSub, write the message to Parquet and store the file in a GCS bucket, I would advise you to do the following process as steps of your pipeline: Read your messages, write to a parquet file and store it in GCS.

I encourage you to read the above links. Then if you have any other question you can post another thread in order to get more specific help.



来源:https://stackoverflow.com/questions/63017832/google-cloud-dataflow-from-pubsub-to-parquet

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!