Does anyone know where the documentation is for the definition of BigQuery schemas? In other words, the JSON schema you supply when uploading files - personsDataSchema.json
To define a schema, all you need basically is to define 3 fields: name
, type
and mode
.
Each field in your table must have defined these 3 keys. If you have for instance a table like:
user_id source
1 search
2 email
Then the schema could be defined as:
[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
{"name": "source", "type": "STRING", "mode": "NULLABLE"}]
The key name
just describes the field name, such as "user_id".
The key type
is the data type, such as STRING, INTEGER, FLOAT and so on. Currently, BigQuery supports these types:
Now, if you open the documentation, you'll see that we also have the data type ARRAY
that is a REPEATED field. I'll discuss more about them later.
The third key, mode
, can be one of these:
NULL
)NULL
)So, let's take our previous example and add a repeated field (i.e, ARRAY field) to illustrate:
user_id source wishlist
1 search ["sku 0", "sku 1"]
2 email []
3 direct ["sku 0", "sku 3"]
The schema could be defined as follows:
[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
{"name": "source", "type": "STRING", "mode": "NULLABLE"},
{"name": "wishlist", "type": "STRING", "mode": "REPEATED"}]
And there you have it, the ARRAY field defined as a repetition of string values.
We are still left with one type of field and that is the RECORD field (STRUCT). These are basically the same, except that we also defined a fourth key fields
for them. As RECORDs includes other fields, you must describe their definition as well; this is easier to understand with an example:
user_id source wishlist location.country location.city
1 search ["sku 0", "sku 1"] USA NY
2 email [] USA LA
3 direct ["sku 0", "sku 3"] BR SP
Here, location
is a RECORD (STRUCT) with 2 keys inside: the country
and the city
. That's how you'd define a schema for them:
[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
{"name": "source", "type": "STRING", "mode": "NULLABLE"},
{"name": "wishlist", "type": "STRING", "mode": "REPEATED"},
{"name": "location", "type": "RECORD", "mode": "NULLABLE", "fields": [{"name": "country", "type": "STRING", "mode": "NULLABLE"}, {"name": "city", "type": "STRING", "mode": "NULLABLE"}]}]
You want to have a REPEATED field of RECORDS? Sure, why not! If you want a REPEATED field for each hit
your client had in your website for instance, you can define the schema like so:
[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
{"name": "source", "type": "STRING", "mode": "NULLABLE"},
{"name": "wishlist", "type": "STRING", "mode": "REPEATED"},
{"name": "location", "type": "RECORD", "mode": "NULLABLE", "fields": [{"name": "country", "type": "STRING", "mode": "NULLABLE"}, {"name": "city", "type": "STRING", "mode": "NULLABLE"}]},
{"name": "hit", "type": "RECORD", "mode": "REPEATED", "fields": [{"name": "hitNumber", "type": "INT64", "mode": "NULLABLE"}, {"name": "hitPage", "type": "STRING", "mode": "NULLABLE"}]}]
Given all that, we can finally answer your question, how would dataPersons.json
schema be defined?
This is an example of a row of personsData:
{"kind": "person",
"fullName": "John Doe",
"age": 22,
"gender": "Male",
"phoneNumber": {"areaCode": "206", "number": "1234567"},
"children": [{"name": "Jane", "gender": "Female", "age": "6"},
{"name": "John", "gender": "Male", "age": "15"}],
"citiesLived": [{"place": "Seattle", "yearsLived": ["1995"]},
{"place": "Stockholm", "yearsLived": ["2005"]}]}
First, we have "kind": "person"
. This is easy, its schema would be:
{"name": "kind", "type": "STRING", "mode": "REQUIRED" or "NULLABLE"}
phoneNumber
is a RECORD (STRUCT) field with two inner fields, areaCode
and number
. Well, we already saw how to define them!
{"name": "phoneNumber",
"type": "RECORD",
"mode": "NULLABLE OR REQUIRED",
"fields": [{"name": "areaCode", "type": "INT64", "mode": "NULLABLE"},
{"name": "number", "type": "INT64", "mode": "NULLABLE"}]}
Now children
and citiesLived
have the same definition, that is, they are both a REPEATED (ARRAY) field of RECORDs (STRUCT). Just as in our last example, this one should be straightforward as well; citiesLived
would be defined as:
{"name": "citiesLived",
"type": "RECORD",
"mode": "REPEATED",
"fields": [{"name": "place", "type": "STRING", "mode": "NULLABLE"},
{"name": "yearLived", "type": "INT64", "mode": "REPEATED"}]}
And there you have it. That's basically all there is to schemas definition. If you are using Python for instance, the idea is the same. You import the class SchemaField
to define each field, like so:
from google.cloud.bigquery import SchemaField
field_kind = SchemaField(name="kind", type="STRING", mode="NULLABLE")
Other clients will follow the same idea.
So to summarize, you always have to define 3 keys for each field in your table: name
, type
and mode
. If the field is of type RECORD, then you also have to define fields
and for each inner field, you again define the 3 keys (4, if the inner field is of type RECORD again).
Hopefully this made a bit more clear on how to define a schema. Let me know if you still have any questions regarding this subject and I'll update the answer.