Definition/documentation for BigQuery schemas?

后端 未结 1 1189
無奈伤痛
無奈伤痛 2020-12-22 00:08

Does anyone know where the documentation is for the definition of BigQuery schemas? In other words, the JSON schema you supply when uploading files - personsDataSchema.json

相关标签:
1条回答
  • 2020-12-22 00:43

    To define a schema, all you need basically is to define 3 fields: name, type and mode.

    Each field in your table must have defined these 3 keys. If you have for instance a table like:

    user_id    source
    1          search
    2          email
    

    Then the schema could be defined as:

    [{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
     {"name": "source", "type": "STRING", "mode": "NULLABLE"}]
    

    The key name just describes the field name, such as "user_id".

    The key type is the data type, such as STRING, INTEGER, FLOAT and so on. Currently, BigQuery supports these types:

    • STRING
    • INT64
    • FLOAT64
    • BOOL
    • BYTES (enconding bytes gives you the string representation).
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP
    • RECORD

    Now, if you open the documentation, you'll see that we also have the data type ARRAY that is a REPEATED field. I'll discuss more about them later.

    The third key, mode, can be one of these:

    • NULLABLE (allows values to be NULL)
    • REQUIRED (does not allow values to be NULL)
    • REPEATED (this is the ARRAY field, it means that the field is basically a list of values).

    So, let's take our previous example and add a repeated field (i.e, ARRAY field) to illustrate:

    user_id    source    wishlist
    1          search    ["sku 0", "sku 1"]
    2          email     []
    3          direct    ["sku 0", "sku 3"]
    

    The schema could be defined as follows:

    [{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
     {"name": "source", "type": "STRING", "mode": "NULLABLE"},
     {"name": "wishlist", "type": "STRING", "mode": "REPEATED"}]
    

    And there you have it, the ARRAY field defined as a repetition of string values.

    We are still left with one type of field and that is the RECORD field (STRUCT). These are basically the same, except that we also defined a fourth key fields for them. As RECORDs includes other fields, you must describe their definition as well; this is easier to understand with an example:

    user_id    source    wishlist            location.country    location.city
    1          search    ["sku 0", "sku 1"]  USA                 NY
    2          email     []                  USA                 LA
    3          direct    ["sku 0", "sku 3"]  BR                  SP
    

    Here, location is a RECORD (STRUCT) with 2 keys inside: the country and the city. That's how you'd define a schema for them:

    [{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
     {"name": "source", "type": "STRING", "mode": "NULLABLE"},
     {"name": "wishlist", "type": "STRING", "mode": "REPEATED"},
     {"name": "location", "type": "RECORD", "mode": "NULLABLE", "fields": [{"name": "country", "type": "STRING", "mode": "NULLABLE"}, {"name": "city", "type": "STRING", "mode": "NULLABLE"}]}]
    

    You want to have a REPEATED field of RECORDS? Sure, why not! If you want a REPEATED field for each hit your client had in your website for instance, you can define the schema like so:

    [{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
     {"name": "source", "type": "STRING", "mode": "NULLABLE"},
     {"name": "wishlist", "type": "STRING", "mode": "REPEATED"},
     {"name": "location", "type": "RECORD", "mode": "NULLABLE", "fields": [{"name": "country", "type": "STRING", "mode": "NULLABLE"}, {"name": "city", "type": "STRING", "mode": "NULLABLE"}]},
     {"name": "hit", "type": "RECORD", "mode": "REPEATED", "fields": [{"name": "hitNumber", "type": "INT64", "mode": "NULLABLE"}, {"name": "hitPage", "type": "STRING", "mode": "NULLABLE"}]}]
    

    Given all that, we can finally answer your question, how would dataPersons.json schema be defined?

    This is an example of a row of personsData:

    {"kind": "person",
     "fullName": "John Doe",
     "age": 22,
     "gender": "Male",
     "phoneNumber": {"areaCode": "206", "number": "1234567"},
     "children": [{"name": "Jane", "gender": "Female", "age": "6"},
                  {"name": "John", "gender": "Male", "age": "15"}],
     "citiesLived": [{"place": "Seattle", "yearsLived": ["1995"]},
                     {"place": "Stockholm", "yearsLived": ["2005"]}]}
    

    First, we have "kind": "person". This is easy, its schema would be:

    {"name": "kind", "type": "STRING", "mode": "REQUIRED" or "NULLABLE"}
    

    phoneNumber is a RECORD (STRUCT) field with two inner fields, areaCode and number. Well, we already saw how to define them!

    {"name": "phoneNumber",
     "type": "RECORD",
     "mode": "NULLABLE OR REQUIRED",
     "fields": [{"name": "areaCode", "type": "INT64", "mode": "NULLABLE"},
                {"name": "number", "type": "INT64", "mode": "NULLABLE"}]}
    

    Now children and citiesLived have the same definition, that is, they are both a REPEATED (ARRAY) field of RECORDs (STRUCT). Just as in our last example, this one should be straightforward as well; citiesLived would be defined as:

    {"name": "citiesLived",
     "type": "RECORD",
     "mode": "REPEATED",
     "fields": [{"name": "place", "type": "STRING", "mode": "NULLABLE"},
                {"name": "yearLived", "type": "INT64", "mode": "REPEATED"}]}
    

    And there you have it. That's basically all there is to schemas definition. If you are using Python for instance, the idea is the same. You import the class SchemaField to define each field, like so:

    from google.cloud.bigquery import SchemaField
    field_kind = SchemaField(name="kind", type="STRING", mode="NULLABLE")
    

    Other clients will follow the same idea.

    So to summarize, you always have to define 3 keys for each field in your table: name, type and mode. If the field is of type RECORD, then you also have to define fields and for each inner field, you again define the 3 keys (4, if the inner field is of type RECORD again).

    Hopefully this made a bit more clear on how to define a schema. Let me know if you still have any questions regarding this subject and I'll update the answer.

    0 讨论(0)
提交回复
热议问题