Unnest and totals.timeOnSite (BigQuery and Google Analytics data)

前端 未结 2 1064
温柔的废话
温柔的废话 2021-01-26 12:23

I want to calculate the total timeOnSite for all visitors to a website (and divide it by 3600 because it\'s stored as seconds in the raw data), and then I want to break it down

相关标签:
2条回答
  • 2021-01-26 12:35

    It might seem odd that I'm answering my own question like this, but a contact of mine from outside of Stack Overflow helped me solve this, so it's actually his answer rather than mine.

    The problem with session_duration can be solved by using a window function (you can read more about window functions in the BigQuery documentation: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions)

    #StandardSQL
    SELECT   
     iso_date,   
     content_group,   
     content_level,  
     COUNT(DISTINCT SessionId) AS sessions, 
     SUM(session_duration) AS session_duration 
    FROM (   
         SELECT   
           date AS iso_date,   
           hits.contentGroup.contentGroup1 AS content_group,   
           (SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,  
           CONCAT(CAST(fullVisitorId AS STRING), CAST(visitId AS STRING)) AS SessionId, 
           (LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) - hits.time) / 3600000 AS session_duration 
         FROM `projectname.123456789.ga_sessions_20170101`,   
           unnest(hits) AS hits
         WHERE _TABLE_SUFFIX BETWEEN "20170101" AND "20170131" 
           AND (SELECT 
                  MAX(IF(index=51, value, NULL)) 
                FROM 
                  UNNEST(hits.customDimensions) 
                WHERE 
                  value IN ("web", "phone", "tablet")
                ) IS NOT NULL 
         GROUP BY   
           iso_date, content_group, content_level
         ORDER BY 
           iso_date, content_group, content_level
        )   
    GROUP BY iso_date, content_group, content_level
    ORDER BY iso_date, content_group, content_level 
    

    Both LEAD - OVER - PARTITION in the subselect and the subsubselect in the WHERE-clause are required for the window function to work properly.

    A more accurate way of calculating sessions is also provided.

    0 讨论(0)
  • 2021-01-26 12:38

    I couldn't fully test this one but it seems to be working against my dataset:

    SELECT
      DATE,
      COUNT(DISTINCT CONCAT(fv, CAST(v AS STRING))) sessions,
      AVG(tos) avg_time_on_site,
      content_group,
      content_level
    FROM(
      SELECT   
       date AS date,   
       fullvisitorid fv,
       visitid v,
       ARRAY(SELECT DISTINCT contentGroup.contentGroup1 FROM UNNEST(hits)) AS content_group,   
       ARRAY(SELECT DISTINCT value FROM UNNEST(hits) AS hits, UNNEST(hits.customDimensions) AS custd WHERE index = 51) AS content_level,   
       totals.timeOnSite / 3600 AS tos 
      FROM `dataset_id.ga_sessions_20170101`
      WHERE totals.timeOnSite IS NOT NULL
      )
    CROSS JOIN UNNEST(content_group) content_group
    LEFT JOIN UNNEST(content_level) content_level
    GROUP BY
      DATE, content_group, content_level
    

    What I tried to do is first to avoid the UNNEST(hits) operation on the entire dataset. Therefore, in the very first SELECT statement, content_group and content_level are stored as ARRAYs.

    In the next SELECT, I unnested both of those ARRAYs and counted for the total sessions and the average time on site while grouping for the desired fields (I used the average here as it seems to make more sense when dealing with time on site but if you need the summation you can just change the AVG to SUM).

    You won't have the problem of repeated timeOnSite in this query because the outer UNNEST(hits) was avoided. When the UNNEST(content_group) and UNNEST(content_level) happens, each value inside those ARRAYs gets associated only once to its correspondent time_on_site so no duplication is happening.

    0 讨论(0)
提交回复
热议问题