问题
I am working on a project wherein I have to store events related to user activity per user on a daily basis for later analysis. I will be getting stream of timestamped events and later on will run dataflow jobs on this data for analytics to get stats per user. I am exploring big table to store this data, wherein timestamp will act as a key for each row, later I will run a range query to get single day data and process it. But after going through couple of resources figured that with timestamped row keys , big table can get into hotspotting problem. Can't promote userid as a key in row key to avoid this. Any alternative approach to solve this or any other storage engine that can help in this use case.
use case: The use case is that I have user Activity data like impression and clicks in streams. Based on rules I have to aggregate data from these streams for a certain duration, store it and serve it asap to upstream service. Data will be processed in a tumbling window fashion as of know 24 hr but it may increase or decrease. The choice I have to make is, how to store raw events(Bigtable or big query or direct analysis on streams), compute engine(beam vs aggregation queries) and final storage(based on user id). Relation b/w user and aggregated data is one to many.
回答1:
Given that you can't access the userid at query time, you'll have to make a tradeoff somewhere. It seems like you'll be doing more writes than reads here because you're writing data during the day for each user and then only doing a read maybe once a day to analyze the data? Correct me if my interpretation is wrong.
I would say it's fine if the scan in your dataflow job isn't as efficient in order to avoid hotspots for your writing.
You can promote the userid into your rowkey to something like this userid#date
, and then do a scan with a rowkey regex filter that does looks for *#YOUR_DATE
.
This isn't the most efficient scan since it is a full table scan AND uses a fairly intensive filter, but to optimize your database for writing data, this would still allow you to read the data.
Feel free to provide more information about your pipeline and expected database use case, if my assumptions don't align with your goals.
回答2:
If you need to store large amounts of data and have it available by timestamp or date for later analysis you should use BigQuery instead of BigTable.
BigQuery offers the option to partition the tables by timestamps/dates.
Each partition in a date/timestamp partitioned table can be thought of as a range where the start of the range is the beginning of a day, and the interval of the range is one day. Date/timestamp partitioned tables do not need a _PARTITIONTIME pseudo column. Queries against date/timestamp partitioned tables can specify predicate filters based on the partitioning column to reduce the amount of data scanned.
来源:https://stackoverflow.com/questions/60948647/timeseries-data-schema-design-for-google-bigtable-or-any-google-offering