I have an app that sends data to Google Analytics. I am interested in accessing and storing this data on a Hadoop cluster. I am guessing this raw data will be in the form of
There is no way to get the logs, but ..
The Google Analytics API will let you extract your data out of the system.
There are limits to what you can do:
It may be good to note that a professional Google Analytics customer could export the raw data from GA to Big Query. Exporting data from BigQuery is free of charge, but storage and query processing is priced based on usage.
Premium analytics at a reasonable price for one flat annual fee of $150,000
since we're supposed to answer the original question, there is no way to get actual raw Google Analytics logs other than by duplicating the server call system.
In other words, you need to use a modified copy of the analytics.js script to point to a hosted webserver that can collect server calls.
Long story short, you want your site to capture hits to http://www.yourdatacollectionserver.com/collect?v=1&t=pageview[...] instead of http://www.google-analytics.com/collect?v=1&t=pageview[...]
This is easily deployed using a tag manager such as Google's GTM, along with normal Google Analytics tags.
That will effectively create log entries in your web server which you can process using an ETL or Snowplow or Splunk or your favorite Python/perl/Ruby text parsing engine.
It is then up to you to process the actual raw logs into something manageable. And before you ask, this is not retroactive.
To get GA data click by click you can make queries in a way that gives you the ability to join data together.
First you need to prepare the data in GA. So with each hit you send, add some hashed value or the clientId + some timestamp into a custom dimension. This will give you the ability to join each query result.
E.g. (this is how we do it at Scitylana) This script below hooks into GA's tracking script and makes sure each hit contains a key for later stitching of query results
<script>
var BindingsDimensionIndex = CUSTOM DIMENSION INDEX HERE;
var Version = 1;
function overrideBuildTask() {
var c = window[window['GoogleAnalyticsObject'] || 'ga'];
var d = c.getAll();
if (console) { console.log('Found ' + d.length + ' ga trackers') }
for (var i = 0; i < d.length; i++) {
var e = d[i]; var f = e.get('name');
if (console) { console.log(f + ' modified') }
var g = e.get('buildHitTask');
if (!e.buildHitTaskIsModified) {
e.set('buildHitTask', function(a) {
window['_sc_order'] = typeof window['_sc_order'] == 'undefined' ? 0 : window['_sc_order'] + 1;
var b = ['sl=' + Version, 'u=' + e.get('clientId'), 't=' + (new Date().getTime() + window['_sc_order'])].join('&');
a.set('dimension' + BindingsDimensionIndex, b);
g(a);
if (console) {
console.log(f + '.' + a.get('hitType') + '.set.customDimension' + BindingsDimensionIndex + ' = ' + b)
}
});
e.buildHitTaskIsModified = true
}
}
}
window.ga = window.ga || function() {
(ga.q = ga.q || []).push(arguments);
if (arguments[0] === 'create') { ga(overrideBuildTask) }
};
ga.l = +new Date();
</script>
Of course now you need to make some script that joins all the results you have taken out of GA.
You can get aggregated data, ie. data you can see in your Google Analytics account, using Google Analytics API. To get raw data, you need to be a premium user (costs ~150k per Year). Premium users can export into Google BigQuery and from there to wherever you want.