Bigtable-BigQuery Import via DataFlow: 2 questions on table partitioning and Timestamps

问题

I have a job in Dataflow importing data from Bigtable into Bigquery by using built-in Dataflow APIs for both. I have two questions:

Question 1: If the source data is in one large table in Bigtable, how can I partition it into a set of sub- or smaller tables in BigQuery dynamically based on, say, the given Bigtable row-key known only at run-time?

The Java code in Dataflow looks like this:

p.apply(Read.from(CloudBigtableIO.read(config)))
        .apply(ParDo.of(new SomeDoFNonBTSourceData()))
        .apply(BigQueryIO.Write
                  .to(PROJ_ID + ":" + BQ_DataSet + "." + BQ_TableName)
                  .withSchema(schema)
                  .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
                  .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
        p.run();

So, since BQ_TableName has to be supplied at code-level, how can I provide it programmatically based on what is seen inside the SomeDoFNonBTSourceData, like a range of values of the current RowKey? If RowKey is 'a-c' then TableA, if 'd-f' then TableB, etc.

Question 2: What is the right way to export the Bigtable Timestamp into Bigquery so as to eventually reconstruct it in human-readable format in BigQuery?

The processElement function within the DoFn looks like this:

public void processElement(ProcessContext c)
{
    String valA = new String(c.element().getColumnLatestCell(COL_FAM, COL_NAME).getValueArray());
    Long timeStamp = c.element().getColumnLatestCell(COL_FAM, COL_NAME).getTimestamp();

    tr.put("ColA", valA);
    tr.put("TimeStamp",timeStamp);
    c.output(tr);
}

And during the Pipeline construction, the BQ schema setup for the timeStamp column looks like this:

List<TableFieldSchema> fields = new ArrayList<>();
    fields.add(new TableFieldSchema().setName("ColA").setType("STRING"));
    fields.add(new TableFieldSchema().setName("TimeStamp").setType("TIMESTAMP"));
    schema = new TableSchema().setFields(fields);

So the Bigtable timestamp seems to be of type Long, and I have tried both "TIMESTAMP" and "INTEGER" types for the destination TimeStamp column in BQ (seems like there is no Long in BQ as such). Ultimately, I need to use the TimeStamp column in BQ both for 'order by' clauses and to display the information in human-readable form (date and time). The 'order by' part seems to work OK, but I have not managed to CAST the end result into anything meaningful -- either get cast errors or something still unreadable.

回答1:

Incidentally am here looking for an answer to an issue similar to Question 1 :).

For the second question, I think you first need to confirm that the Long timestamp is indeed a UNIX timestamp, I've always assumed BQ can ingest that as a timestamp without any conversion.

But you can try this...

Long longTimeStamp = 1408452095L;

Date timeStamp = new Date();
timeStamp.setTime(longTimeStamp * 1000);

tr.put("TimeStamp", timeStamp.toInstant().toString());

来源：https://stackoverflow.com/questions/41698754/bigtable-bigquery-import-via-dataflow-2-questions-on-table-partitioning-and-tim

标签

google-bigquery

google-cloud-dataflow

google-cloud-bigtable