Retrieving many rows using a TableBatchOperation is not supported?

前端 未结 7 701
被撕碎了的回忆
被撕碎了的回忆 2021-01-11 11:55

Here is a piece of code that initialize a TableBatchOperation designed to retrieve two rows in a single batch:

 TableBatchOperation batch = new TableBatchOpe         


        
相关标签:
7条回答
  • 2021-01-11 12:03

    Try something like this:

    TableQuery<DynamicTableEntity> query = new TableQuery<DynamicTableEntity>()
                                                    .Where(TableQuery.CombineFilters(
                                                        TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, "partition1"),
                                                        TableOperators.And,
                                                        TableQuery.CombineFilters(
                                                            TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, "row1"),
                                                            TableOperators.Or,
                                                            TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, "row2"))));
    
    0 讨论(0)
  • 2021-01-11 12:04

    I know that this is an old question, but as Azure STILL does not support secondary indexes, it seems it will be relevant for some time.

    I was hitting the same type of problem. In my scenario, I needed to lookup hundreds of items within the same partition, where there are millions of rows (imagine GUID as row-key). I tested a couple of options to lookup 10,000 rows

    1. (PK && RK)
    2. (PK && RK1) || (PK & RK2) || ...
    3. PK && (RK1 || RK2 || ... )

    I was using the Async API, with a maximum 10 degrees of parallelism (max 10 outstanding requests). I also tested a couple of different batch sizes (10 rows, 50, 100).

    Test                        Batch Size  API calls   Elapsed (sec)
    (PK && RK)                  1           10000       95.76
    (PK && RK1) || (PK && RK2)  10          1000        25.94
    (PK && RK1) || (PK && RK2)  50          200         18.35
    (PK && RK1) || (PK && RK2)  100         100         17.38
    PK && (RK1 || RK2 || … )    10          1000        24.55
    PK && (RK1 || RK2 || … )    50          200         14.90
    PK && (RK1 || RK2 || … )    100         100         13.43
    

    NB: These are all within the same partition - just multiple rowkeys.

    I would have been happy to just reduce the number of API calls. But as an added benefit, the elapsed time is also significantly less, saving on compute costs (at least on my end!).

    Not too surprising, the batches of 100 rows delivered the best elapsed performance. There are obviously other performance considerations, especially network usage (#1 hardly uses the network at all for example, whereas the others push it much harder)

    EDIT Be careful when querying for many rowkeys. There is (or course) a URL length limitation to the query. If you exceed the length, the query will still succeed because the service can not tell that the URL was truncated. In our case, we limited the combined query length to about 2500 characters (URL encoded!)

    0 讨论(0)
  • 2021-01-11 12:08

    Batch "Get" operations are not supported by Azure Table Storage. Supported operations are: Add, Delete, Update, and Merge. You would need to execute queries as separate requests. For faster processing, you may want to execute these queries in parallel.

    0 讨论(0)
  • 2021-01-11 12:19

    How many entities do you have per partition? With one retrieve operation you can pull back up to 1000 records per query. Then you could do your Row Key filtering on the in memory set and only pay for 1 operation.

    Another option is to do a Row Key range query to retrieve part of a partition in one operation. Essentially you specify an upper and lower bound for the row keys to return, rather than an entire partition.

    0 讨论(0)
  • 2021-01-11 12:23

    Okay, so a batch retrieve operation, best case scenario is a table query. Less optimal situation would require parallel retrieve operations.

    Depending on your PK, RK design you can based on a list of (PK, RK) figure out what is the smallest/most efficient set of retrieve/query operations that you need to perform. You then fetch all these things in parallel and sort out the exact answer client side.

    IMAO, it was a design miss by Microsoft to add the Retrieve method to the TableBatchOperation class because it conveys semantics not supported by the table storage API.

    Right now, I'm not in the mood to write something super efficient, so I'm just gonna leave this super simple solution here.

    var retrieveTasks = new List<Task<TableResult>>();
    
    foreach (var item in list)
    {
        retrieveTasks.Add(table.ExecuteAsync(TableOperation.Retrieve(item.pk, item.rk)));
    }
    
    var retrieveResults = new List<TableResult>();
    
    foreach (var retrieveTask in retrieveTasks)
    {
        retrieveResults.Add(await retrieveTask);
    }
    

    This asynchronous block of code will fetch the entities in list in parallel and store the result in the retrieveResults preserving the order. If you have continuous ranges of entities that you need to fetch you can improve this by using a rang query.

    There a sweet spot (that you'll have to find by testing this) is where it's probably faster/cheaper to query more entities than you might need for a specific batch retrieve then discard the results you retrieve that you don't need.

    If you have a small partition you might benefit from a query like so:

    where pk=partition1 and (rk=rk1 or rk=rk2 or rk=rk3)
    

    If the lexicographic (i.e. sort order) distance is great between your keys you might want to fetch them in parallel. For example, if you store the alphabet in table storage, fetching a and z which are far apart is best to do with parallel retrieve operations while fetching a, b and c which are close together is best to do with a query. Fetching a, b c, and z would benefit from a hybrid approach.

    If you know all this up front you can compute what is the best thing to do given a set of PKs and RKs. The more you know about how the underlying data is sorted the better your results will be. I'd advice a general approach to this one and instead, try to apply what you learn from these different query patterns to solve your problem.

    0 讨论(0)
  • 2021-01-11 12:24

    Your best bet is to create a Linq/OData select query... that will fetch what you're looking for.

    For better performance you should make one query per partition and run those queries simultaneously.

    I haven't tested this personally, but think it would work.

    0 讨论(0)
提交回复
热议问题