How do I add headers for the output csv for apache beam dataflow?

我与影子孤独终老i 提交于 2021-02-08 03:33:46

问题


I noticed in the java sdk, there is a function that allows you to write the headers of a csv file. https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/TextIO.Write.html#withHeader-java.lang.String-

Is this features mirrored on the python skd?


回答1:


You can now write to a text and specify a header using the text sink.

From the documentation:

class apache_beam.io.textio.WriteToText(file_path_prefix, file_name_suffix='', append_trailing_newlines=True, num_shards=0, shard_name_template=None, coder=ToStringCoder, compression_type='auto', header=None)

So you can use the following piece of code:

beam.io.WriteToText(bucket_name, file_name_suffix='.csv', header='colname1, colname2')

The complete documentation is available here if you want details or check how it is implemented: https://beam.apache.org/documentation/sdks/pydoc/2.0.0/_modules/apache_beam/io/textio.html#WriteToText




回答2:


This is not implemented at this moment. However you can implement/extend it yourself (see attached notebook for an example+test with my version of apache_beam).

This is based on a note in the docstring of the superclass FileSink, mentioning that you should overwrite the open function:

The new class that works for my version of apache_beam ('0.3.0-incubating.dev'):

import apache_beam as beam
from apache_beam.io import TextFileSink
from apache_beam.io.fileio import ChannelFactory,CompressionTypes
from apache_beam import coders


class TextFileSinkWithHeader(TextFileSink):
    def __init__(self,
               file_path_prefix,
               file_name_suffix='',
               append_trailing_newlines=True,
               num_shards=0,
               shard_name_template=None,
               coder=coders.ToStringCoder(),
               compression_type=CompressionTypes.NO_COMPRESSION,
               header=None):
        super(TextFileSinkWithHeader, self).__init__(
            file_path_prefix,
            file_name_suffix=file_name_suffix,
            num_shards=num_shards,
            shard_name_template=shard_name_template,
            coder=coder,

            compression_type=compression_type,
            append_trailing_newlines=append_trailing_newlines)
        self.header = header

    def open(self, temp_path):
        channel_factory = ChannelFactory.open(
            temp_path,
            'wb',
            mime_type=self.mime_type)
        channel_factory.write(self.header+"\n")
        return channel_factory

You can subsequently use it as follows:

beam.io.Write(TextFileSinkWithHeader('./names_w_headers',header="names"))

See the notebook for the complete overview.




回答3:


This feature does not yet exist in the Python SDK



来源:https://stackoverflow.com/questions/39624809/how-do-i-add-headers-for-the-output-csv-for-apache-beam-dataflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!