I want to calculate the total timeOnSite for all visitors to a website (and divide it by 3600 because it\'s stored as seconds in the raw data), and then I want to break it down
It might seem odd that I'm answering my own question like this, but a contact of mine from outside of Stack Overflow helped me solve this, so it's actually his answer rather than mine.
The problem with session_duration can be solved by using a window function (you can read more about window functions in the BigQuery documentation: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions)
#StandardSQL
SELECT
iso_date,
content_group,
content_level,
COUNT(DISTINCT SessionId) AS sessions,
SUM(session_duration) AS session_duration
FROM (
SELECT
date AS iso_date,
hits.contentGroup.contentGroup1 AS content_group,
(SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,
CONCAT(CAST(fullVisitorId AS STRING), CAST(visitId AS STRING)) AS SessionId,
(LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) - hits.time) / 3600000 AS session_duration
FROM `projectname.123456789.ga_sessions_20170101`,
unnest(hits) AS hits
WHERE _TABLE_SUFFIX BETWEEN "20170101" AND "20170131"
AND (SELECT
MAX(IF(index=51, value, NULL))
FROM
UNNEST(hits.customDimensions)
WHERE
value IN ("web", "phone", "tablet")
) IS NOT NULL
GROUP BY
iso_date, content_group, content_level
ORDER BY
iso_date, content_group, content_level
)
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level
Both LEAD - OVER - PARTITION in the subselect and the subsubselect in the WHERE-clause are required for the window function to work properly.
A more accurate way of calculating sessions is also provided.
I couldn't fully test this one but it seems to be working against my dataset:
SELECT
DATE,
COUNT(DISTINCT CONCAT(fv, CAST(v AS STRING))) sessions,
AVG(tos) avg_time_on_site,
content_group,
content_level
FROM(
SELECT
date AS date,
fullvisitorid fv,
visitid v,
ARRAY(SELECT DISTINCT contentGroup.contentGroup1 FROM UNNEST(hits)) AS content_group,
ARRAY(SELECT DISTINCT value FROM UNNEST(hits) AS hits, UNNEST(hits.customDimensions) AS custd WHERE index = 51) AS content_level,
totals.timeOnSite / 3600 AS tos
FROM `dataset_id.ga_sessions_20170101`
WHERE totals.timeOnSite IS NOT NULL
)
CROSS JOIN UNNEST(content_group) content_group
LEFT JOIN UNNEST(content_level) content_level
GROUP BY
DATE, content_group, content_level
What I tried to do is first to avoid the UNNEST(hits)
operation on the entire dataset. Therefore, in the very first SELECT
statement, content_group
and content_level
are stored as ARRAYs.
In the next SELECT
, I unnested both of those ARRAYs and counted for the total sessions and the average time on site while grouping for the desired fields (I used the average here as it seems to make more sense when dealing with time on site but if you need the summation you can just change the AVG
to SUM
).
You won't have the problem of repeated timeOnSite
in this query because the outer UNNEST(hits)
was avoided. When the UNNEST(content_group)
and UNNEST(content_level)
happens, each value inside those ARRAYs gets associated only once to its correspondent time_on_site
so no duplication is happening.