How to partition Azure tables used for storing logs

后端 未结 4 531
日久生厌
日久生厌 2021-02-06 06:35

We have recently updated our logging to use Azure table storage, which owing to its low cost and high performance when querying by row and partition is highly suited to this pur

相关标签:
4条回答
  • 2021-02-06 07:11

    Not really a concrete answer to your question, but here are some of my thoughts:

    What you really need to think about is how are you going to query your data and design your data storage/partitioning strategy based on that (keeping in mind the Partitioning Strategy guid). For example,

    • If you need to view logs for all loggers within a given date/time range, then your current approach might not be appropriate because you would need to query across multiple partitions in parallel.
    • Your current approach would work if you want to query for specific logger within a given date/time range.
    • Another thing that was suggested to me is to make appropriate use of blob storage & table storage. If there's some data which does not require querying that often, you can simply push that data in blob storage (think about old logs - you don't really need to keep them in tables if you're not going to query them too often). Whenever you need such data, you can simply extract it from blob storage, push it in table storage and run your ad-hoc queries against that data.

    Possible Solution

    One possible solution would be to store multiple copies of the same data and use those copies accordingly. Since storage is cheap, you can save two copies of the same data. In 1st copy you could have PK = Date/Time and RK = whatever you decide and in 2nd copy you could have PK = Logger and RK = TicksReversed+GUID. Then when you want to fetch all logs irrespective of the logger, you could simply query the 1st copy (PK = Date/Time) and if you want to query logs for a specific logger type, you could simply query 2nd copy (PK = Logger, RK >= Date/Time Start & RK <= Date/Time End).

    You may also find this link helpful: http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/

    0 讨论(0)
  • 2021-02-06 07:18

    If I'm reading the question correctly, here are the solution constraints:

    • Use Table storage
    • High scale write
    • Separated by product area
    • Automatically ordered by time

    There are several good solutions already presented, but I don't think there's an answer that satisfies all the constraints perfectly.

    The solution that seems closest to satisfying your constraints was provided by usr. Divide your product area partitions into N, but don't use GUIDs, just use a number (ProductArea-5). Using GUIDs makes the querying problem much more difficult. If you use a number, you can query all of the partitions for a product area in a single query or even in parallel. Then continue to use TicksReversed+GUID for RowKey.

    Single Query: PartitionKey ge 'ProductArea' and PartitionKey le 'ProductArea-~' and RowKey ge 'StartDateTimeReverseTicks' and RowKey le 'EndDateTimeReverseTicks'

    Parallel Queries: PartitionKey ge 'ProductArea-1' and RowKey ge 'StartDateTimeReverseTicks' and RowKey le 'EndDateTimeReverseTicks' ... PartitionKey ge 'ProductArea-N' and RowKey ge 'StartDateTimeReverseTicks' and RowKey le 'EndDateTimeReverseTicks'

    This solution doesn't satisfy 'automatically ordered by time', but you can do a client-side sort by RowKey to see them in order. If having to sort client-side is okay for you, then this solution should work to satisfy the rest of the constraints.

    0 讨论(0)
  • 2021-02-06 07:22

    There is a very general trick to avoid hot spots when writing while at the same time increasing read costs a bit.

    Define N partitions (like 10 or so). When writing a row stuff it into a random partition. Partitions can be sorted by time internally.

    When reading you need to read from all N partitions (possibly filtered and ordered by time) and merge the query results.

    This increases write scalability by a factor of N and increases query cost by the same number of round-trips and queries.

    Also, you could consider storing logs somewhere else. The very tight artificial limits on Azure products cause labor costs that you otherwise would not have.

    Choose N to be higher than needed to reach the 20,000 operations per second per account limit so that randomly occurring hotspots are unlikely. Picking N to be twice as high as minimally needed should be enough.

    0 讨论(0)
  • 2021-02-06 07:25

    I have come across similar situation you encountered, based on my experience I could say:

    Whenever a query is fired on an azure storage table, it does a full table scan if a proper partition key is not provided. In other words, storage table is indexed on Partition key and partitioning the data properly is the key to get fast results.

    That said, now you will have to think on what kind of queries you would fire on the table. Such as Logs occurred during a time period, for a product etc.

    One way is to use reverse ticks up to hour precision instead of using the exact ticks as part of Partition Key. That way an hour worth of data can be queried based on this partition key. Depending on the number of rows which fall in to each partition, you could change the precision to a day. Also, it will be wise to store related data together, that means data for each product would go to a different table. That way you can reduce the number of partitions and number of rows in each partition.

    Basically, ensure that you know the partition keys in advance (exact or range) and fire queries against such specific partition keys to get results faster.

    To speed up writing to table, you can use Batch Operation. Be cautious though as if one entity on the batch fails whole batch operation fails. Proper retry and error checking can save you here.

    At the same time, you could use blob storage to store lot of related data. The idea is to store a chunk of related serialized data as one blob. You can hit one such blob to get all the data in it and do further projections on the client side. For example, an hour worth of data for a product would go to a blob, you can devise a specific blob prefix naming pattern and hit the exact blob when needed. This will help you get your data pretty fast rather than doing a table scan for each query.

    I used the blob approach and have been using it for couple of years with no troubles. I convert my collection to IList<IDictionary<string,string>> and use binary serialization and Gzip for storing each blob. I use Reflection.Emmit based helper methods to access entity properties pretty fast so serialization and deserialization doesn't take a toll on the CPU and memory.

    Storing data in blobs help me store more for less and get my data faster.

    0 讨论(0)
提交回复
热议问题