how to write Bigquery in new schema with replacing event_dim in old schema from Firebase analytics?

后端 未结 2 1409
无人及你
无人及你 2021-01-21 06:02

The old BigQuery Export schema wise script is running.It is given below. But I want to replicate this code and write it according to new export schema as we Bigquery schema has

相关标签:
2条回答
  • 2021-01-21 06:06

    It is true that there was a schema change in the Google Analytics for Firebase BigQuery Export. Although there is no clear mapping of the old fields as compared to the new ones, the SQL query that is provided in the documentation in order to migrate existing BQ datasets from the old schema to the new one provides some hints of how have these fields changed.

    I share the migration_script.sql SQL query below, just for reference, but let me pin-point the most relevant changes for your use-case:

    • event_dim is mapped as event in the SQL query, but does not have any final representation in the schema, because event_dim is no longer a nested field: UNNEST(event_dim) AS event
    • event_dim.timestamp_micros is mapped as event_timestamp: event.timestamp_micros AS event_timestamp
    • event_dim.name is mapped as event_name: event.name AS event_name
    • event_param.value.int_value is mapped as event_params.value.int_value: event_param.value.int_value AS int_value
    • user_dim.user_properties is mapped as user_properties, and all its nested values follow the same structure: UNNEST(user_dim.user_properties) AS user_property) AS user_properties

    So, in summary, the schema change has been focused at unnesting several of the fields for simplicity, in such a way that, for example, instead of having to access event_dim.name (which would require unnesting and complicating the query), you can query directly the field event_name.

    Having this in mind, I am sure you will be able to adapt your query to this new schema, and it will probably look way more simple, given that you will not have to unnest so many fields.


    Just for clarification, let me share with you a couple of sample BQ queries comparing the old and the new schema (they are using public Firebase tables, so you should be able to run them out-of-the-box):

    # Old Schema - UNNEST() required because there are nested fields
    SELECT
      user_dim.app_info.app_instance_id,
      MIN(event.timestamp_micros) AS min_time,
      MAX(event.timestamp_micros) AS max_time,
      event.name
    FROM
      `firebase-public-project.com_firebase_demo_ANDROID.app_events_20180503`,
      UNNEST(event_dim) AS event
    WHERE
      event.name = "user_engagement"
    GROUP BY
      user_dim.app_info.app_instance_id,
      event.name
    

    As compared to:

    # New Schema - UNNEST() not required because there are no nested fields
    SELECT
      user_pseudo_id,
      MIN(event_timestamp) AS min_time,
      MAX(event_timestamp) AS max_time,
      event_name
    FROM
      `firebase-public-project.analytics_153293282.events_20180815`
    WHERE
      event_name = "user_engagement"
    GROUP BY
      user_pseudo_id,
      event_name
    

    These queries are equivalent, but referencing tables with the old and new schema. Please note that, as your query is more complex, you may need to add some UNNEST() in order to access the remaining nested fields in the table.

    Additionally, you may want to have a look at these samples that can help you with some ideas on how to write queries with the new schema.


    EDIT 2

    My understanding is that a query like the one below should allow you to query for all the fields in a single statement. I am grouping by all the non-aggregated/filtered fields, but depending on your use case (this is definitely something you would need to work on your own) you may want to apply a different strategy in order to be able to query the non-grouped fields (i.e. use a MIN/MAX filter, etc.).

    SELECT
      user_pseudo_id,
      MIN(event_timestamp) AS min_time,
      MAX(event_timestamp) AS max_time,
      event_name,
      par.value.int_value AS engagement_time
    FROM
      `firebase-public-project.analytics_153293282.events_20180815`,
      UNNEST(event_params) as par
    WHERE
      event_name = "user_engagement" AND par.key = "engagement_time_msec"
    GROUP BY
      user_pseudo_id,
      event_name,
      par.value.int_value
    

    ANNEX

    migration_script.sql:

      SELECT
      @date AS event_date,
      event.timestamp_micros AS event_timestamp,
      event.previous_timestamp_micros AS event_previous_timestamp,
      event.name AS event_name,
      event.value_in_usd  AS event_value_in_usd,
       user_dim.bundle_info.bundle_sequence_id AS event_bundle_sequence_id,
      user_dim.bundle_info.server_timestamp_offset_micros as event_server_timestamp_offset,
      (
      SELECT
        ARRAY_AGG(STRUCT(event_param.key AS key,
            STRUCT(event_param.value.string_value AS string_value,
              event_param.value.int_value AS int_value,
              event_param.value.double_value AS double_value, 
              event_param.value.float_value AS float_value) AS value))
      FROM
        UNNEST(event.params) AS event_param) AS event_params,
      user_dim.first_open_timestamp_micros AS user_first_touch_timestamp,
      user_dim.user_id AS user_id,
      user_dim.app_info.app_instance_id AS user_pseudo_id,
      "" AS stream_id,
      user_dim.app_info.app_platform AS platform,
      STRUCT( user_dim.ltv_info.revenue AS revenue,
        user_dim.ltv_info.currency AS currency ) AS user_ltv,
      STRUCT( user_dim.traffic_source.user_acquired_campaign AS name,
          user_dim.traffic_source.user_acquired_medium AS medium,
          user_dim.traffic_source.user_acquired_source AS source ) AS traffic_source,
      STRUCT( user_dim.geo_info.continent AS continent,
        user_dim.geo_info.country AS country,
        user_dim.geo_info.region AS region,
        user_dim.geo_info.city AS city ) AS geo,
      STRUCT( user_dim.device_info.device_category AS category,
        user_dim.device_info.mobile_brand_name,
        user_dim.device_info.mobile_model_name,
        user_dim.device_info.mobile_marketing_name,
        user_dim.device_info.device_model AS mobile_os_hardware_model,
        @platform AS operating_system,
        user_dim.device_info.platform_version AS operating_system_version,
        user_dim.device_info.device_id AS vendor_id,
        user_dim.device_info.resettable_device_id AS advertising_id,
        user_dim.device_info.user_default_language AS language,
        user_dim.device_info.device_time_zone_offset_seconds AS time_zone_offset_seconds,
        IF(user_dim.device_info.limited_ad_tracking, "Yes", "No") AS is_limited_ad_tracking ) AS device,
      STRUCT( user_dim.app_info.app_id AS id,
        @firebase_app_id  AS firebase_app_id,
        user_dim.app_info.app_version AS version,
        user_dim.app_info.app_store AS install_source ) AS app_info,
      (
      SELECT
        ARRAY_AGG(STRUCT(user_property.key AS key,
            STRUCT(user_property.value.value.string_value AS string_value,
              user_property.value.value.int_value AS int_value,
              user_property.value.value.double_value AS double_value,
              user_property.value.value.float_value AS float_value,
              user_property.value.set_timestamp_usec AS set_timestamp_micros ) AS value))
      FROM
        UNNEST(user_dim.user_properties) AS user_property) AS user_properties
    FROM
      `SCRIPT_GENERATED_TABLE_NAME`,
      UNNEST(event_dim) AS event
    
    0 讨论(0)
  • 2021-01-21 06:19

    As I believe my previous answer provides some general ideas for the Community, I will keep it and write a new one in order to be more specific for your use case.

    First of all, I would like to clarify that in order to adapt a query (just like you are asking us to do), one needs to have a clear understanding of the statement, objective of the query, expected results and data to play with. As this is not the case, it is difficult to work with it, even more considering that there are some functionalities that are not clear from the query, for example: in order to obtain the "min_time" and "max_time" for each event, you are taking the min and max value across multiple events, which does not make clear sense to me (it may, depending on your use case, reason why I suggested that it would be better if you could provide more details or work more on the query yourself). Moreover, the new schema "flattens" events, in such a way that each event is written in a different line (you can easily check this by running a SELECT COUNT(*) FROM 'table_with_old_schema' and compare it to SELECT COUNT(*) FROM 'table_with_new_schema'; you will see that the second one has many more rows), so your query does not make sense anymore, because events are not grouped anymore, and then you cannot pick a minimum and maximum between nested fields.

    This being clarified, and having removed some fields that cannot be directly adapted to the new schema (you may be able to adapt this from your side, but this would require some additional effort and understanding of what did those fields mean to you in your previous query), here there are two queries that provide exactly the same results, when run against the same table, with different schema:

    Query against a table with the old schema:

    SELECT
      user_dim.app_info.app_instance_id,
      event.name,
      params.value.int_value engagement_time
    FROM
      `DATASET.app_events_YYYYMMDD`,
      UNNEST(event_dim) AS event,
      UNNEST(event.params) AS params,
      UNNEST(user_dim.user_properties) AS user_params
    WHERE
      (event.name = "user_engagement"
        AND params.key = "engagement_time_msec")
      AND (user_params.key = "plays_quickplay"
        AND user_params.value.value.string_value = "true")
    ORDER BY 1, 2, 3
    

    Query against the same table, with the new schema:

    SELECT
      user_pseudo_id,
      event_name,
      params.value.int_value engagement_time
    FROM
      `DATASET.events_YYYYMMDD`,
      UNNEST(event_params) AS params,
      UNNEST(user_properties) AS user_params
    WHERE
      (event_name = "user_engagement"
        AND params.key = "engagement_time_msec")
      AND (user_params.key = "plays_quickplay"
        AND user_params.value.string_value = "true")
    ORDER BY 1, 2, 3
    

    Again, for this I am using the following table from the public dataset: firebase-public-project.com_firebase_demo_ANDROID.app_events_YYYYMMDD, so I had to change some filters and remove some others in order for it to retrieve sensible results against that table. Therefore, feel free to modify or add the ones you need in order for it to be useful for your use case.

    0 讨论(0)
提交回复
热议问题