Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

前端 未结 6 1332
死守一世寂寞
死守一世寂寞 2020-11-29 12:25

I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK.

相关标签:
6条回答
  • 2020-11-29 12:47

    I have written data into bigquery partitioned tables through dataflow. These writings are dynamic as-in if the data in that partition already exists then I can either append to it or overwrite it.

    I have written the code in Python. It is a batch mode write operation into bigquery.

    client = bigquery.Client(project=projectName)
    dataset_ref = client.dataset(datasetName)
    table_ref = dataset_ref.table(bqTableName)       
    job_config = bigquery.LoadJobConfig()
    job_config.skip_leading_rows = skipLeadingRows
    job_config.source_format = bigquery.SourceFormat.CSV
    if tableExists(client, table_ref):            
        job_config.autodetect = autoDetect
        previous_rows = client.get_table(table_ref).num_rows
        #assert previous_rows > 0
        if allowJaggedRows is True:
            job_config.allowJaggedRows = True
        if allowFieldAddition is True:
            job_config._properties['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']
        if isPartitioned is True:
            job_config._properties['load']['timePartitioning'] = {"type": "DAY"}
        if schemaList is not None:
            job_config.schema = schemaList            
        job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
    else:            
        job_config.autodetect = autoDetect
        job_config._properties['createDisposition'] = 'CREATE_IF_NEEDED'
        job_config.schema = schemaList
        if isPartitioned is True:             
            job_config._properties['load']['timePartitioning'] = {"type": "DAY"}
        if schemaList is not None:
            table = bigquery.Table(table_ref, schema=schemaList)            
    load_job = client.load_table_from_uri(gcsFileName, table_ref, job_config=job_config)        
    assert load_job.job_type == 'load'
    load_job.result()       
    assert load_job.state == 'DONE'
    

    It works fine.

    0 讨论(0)
  • 2020-11-29 12:49

    The approach I took (works in the streaming mode, too):

    • Define a custom window for the incoming record
    • Convert the window into the table/partition name

      p.apply(PubsubIO.Read
                  .subscription(subscription)
                  .withCoder(TableRowJsonCoder.of())
              )
              .apply(Window.into(new TablePartitionWindowFn()) )
              .apply(BigQueryIO.Write
                             .to(new DayPartitionFunc(dataset, table))
                             .withSchema(schema)
                             .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
              );
      

    Setting the window based on the incoming data, the End Instant can be ignored, as the start value is used for setting the partition:

    public class TablePartitionWindowFn extends NonMergingWindowFn<Object, IntervalWindow> {
    
    private IntervalWindow assignWindow(AssignContext context) {
        TableRow source = (TableRow) context.element();
        String dttm_str = (String) source.get("DTTM");
    
        DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd").withZoneUTC();
    
        Instant start_point = Instant.parse(dttm_str,formatter);
        Instant end_point = start_point.withDurationAdded(1000, 1);
    
        return new IntervalWindow(start_point, end_point);
    };
    
    @Override
    public Coder<IntervalWindow> windowCoder() {
        return IntervalWindow.getCoder();
    }
    
    @Override
    public Collection<IntervalWindow> assignWindows(AssignContext c) throws Exception {
        return Arrays.asList(assignWindow(c));
    }
    
    @Override
    public boolean isCompatible(WindowFn<?, ?> other) {
        return false;
    }
    
    @Override
    public IntervalWindow getSideInputWindow(BoundedWindow window) {
        if (window instanceof GlobalWindow) {
            throw new IllegalArgumentException(
                    "Attempted to get side input window for GlobalWindow from non-global WindowFn");
        }
        return null;
    }
    

    Setting the table partition dynamically:

    public class DayPartitionFunc implements SerializableFunction<BoundedWindow, String> {
    
    String destination = "";
    
    public DayPartitionFunc(String dataset, String table) {
        this.destination = dataset + "." + table+ "$";
    }
    
    @Override
    public String apply(BoundedWindow boundedWindow) {
        // The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
        String dayString = DateTimeFormat.forPattern("yyyyMMdd")
                                         .withZone(DateTimeZone.UTC)
                                         .print(((IntervalWindow) boundedWindow).start());
        return destination + dayString;
    }}
    

    Is there a better way of achieving the same outcome?

    0 讨论(0)
  • 2020-11-29 12:54

    I believe it should be possible to use the partition decorator when you are not using streaming. We are actively working on supporting partition decorators through streaming. Please let us know if you are seeing any errors today with non-streaming mode.

    0 讨论(0)
  • 2020-11-29 12:57

    As Pavan says, it is definitely possible to write to partition tables with Dataflow. Are you using the DataflowPipelineRunner operating in streaming mode or batch mode?

    The solution you proposed should work. Specifically, if you pre-create a table with date partitioning set up, then you can use a BigQueryIO.Write.toTableReference lambda to write to a date partition. For example:

    /**
     * A Joda-time formatter that prints a date in format like {@code "20160101"}.
     * Threadsafe.
     */
    private static final DateTimeFormatter FORMATTER =
        DateTimeFormat.forPattern("yyyyMMdd").withZone(DateTimeZone.UTC);
    
    // This code generates a valid BigQuery partition name:
    Instant instant = Instant.now(); // any Joda instant in a reasonable time range
    String baseTableName = "project:dataset.table"; // a valid BigQuery table name
    String partitionName =
        String.format("%s$%s", baseTableName, FORMATTER.print(instant));
    
    0 讨论(0)
  • 2020-11-29 13:02

    If you pass the table name in table_name_YYYYMMDD format, then BigQuery will treat it as a sharded table, which can simulate partition table features. Refer the documentation: https://cloud.google.com/bigquery/docs/partitioned-tables

    0 讨论(0)
  • 2020-11-29 13:08

    Apache Beam version 2.0 supports sharding BigQuery output tables out of the box.

    0 讨论(0)
提交回复
热议问题