Group by multiple keys and summarize/average values of a list of dictionaries

后端 未结 7 1227
梦谈多话
梦谈多话 2020-11-30 01:03

What is the most pythonic way to group by multiple keys and summarize/average values of a list of dictionaries in Python please? Say I have a list of dictionaries as below:<

相关标签:
7条回答
  • 2020-11-30 01:16

    Inspired by Eelco Hoogendoorn's answer. Here is another way to resolve this using Pandas package. The code is more readable.

    import numpy as np
    import pandas as pd
    
    def sum_by_cusip_and_dept(data):
        df = pd.DataFrame(data)
        grouped = df.groupby(['sku', 'dept'])    
        sum = grouped.sum()
        return [{'sku': r[0], 'dept': r[1], 'qty': kv.to_dict().get('qty')} for r, kv in sum.iterrows()]     
    
    0 讨论(0)
  • 2020-11-30 01:17

    You can put the quantities and the number of their occurrences in one big default dict:

    from collections import defaultdict
    
    counts = defaultdict(lambda: [0, 0])
    for line in input_data:
        entry = counts[(line['dept'], line['sku'])]
        entry[0] += line['qty']
        entry[1] += 1
    

    Now it is only the question to get the numbers into a list of dicts:

    sums_dict = [{'dept': k[0], 'sku': k[1], 'qty': v[0]} 
                  for k, v in counts.items()]
    avg_dict = [{'dept': k[0], 'sku': k[1], 'avg': float(v[0]) / v[1]} for 
                 k, v in counts.items()]
    

    The results for the sums:

    sums_dict
    
    [{'dept': '002', 'qty': 600, 'sku': 'qux'},
     {'dept': '001', 'qty': 400, 'sku': 'foo'},
     {'dept': '003', 'qty': 700, 'sku': 'foo'},
     {'dept': '002', 'qty': 900, 'sku': 'baz'},
     {'dept': '001', 'qty': 200, 'sku': 'bar'}]
    

    and for the averages:

    avg_dict
    
    [{'avg': 600.0, 'dept': '002', 'sku': 'qux'},
     {'avg': 200.0, 'dept': '001', 'sku': 'foo'},
     {'avg': 700.0, 'dept': '003', 'sku': 'foo'},
     {'avg': 450.0, 'dept': '002', 'sku': 'baz'},
     {'avg': 200.0, 'dept': '001', 'sku': 'bar'}]
    

    An alternative version without the default dict:

    counts = {}
    for line in input_data:
        entry = counts.setdefault((line['dept'], line['sku']), [0, 0])
        entry[0] += line['qty']
        entry[1] += 1
    

    The rest is the same:

    sums_dict = [{'dept': k[0], 'sku': k[1], 'qty': v[0]} 
                  for k, v in counts.items()]
    avg_dict = [{'dept': k[0], 'sku': k[1], 'avg': float(v[0]) / v[1]} for 
                 k, v in counts.items()]
    
    0 讨论(0)
  • 2020-11-30 01:19

    @thefourtheye If we use groupby only one key, we should check the type of key after group, if not a tuple, return a list.

    for key, grp in groupby(sorted(input_data, key = grouper), grouper):
      if not isinstance(key, tuple):
        key = [key]
    
    0 讨论(0)
  • 2020-11-30 01:29

    To get the aggregated results

    from itertools import groupby
    from operator import itemgetter
    
    grouper = itemgetter("dept", "sku")
    result = []
    for key, grp in groupby(sorted(input_data, key = grouper), grouper):
        temp_dict = dict(zip(["dept", "sku"], key))
        temp_dict["qty"] = sum(item["qty"] for item in grp)
        result.append(temp_dict)
    
    from pprint import pprint
    pprint(result)
    

    Output

    [{'dept': '001', 'qty': 200, 'sku': 'bar'},
     {'dept': '001', 'qty': 400, 'sku': 'foo'},
     {'dept': '002', 'qty': 900, 'sku': 'baz'},
     {'dept': '002', 'qty': 600, 'sku': 'qux'},
     {'dept': '003', 'qty': 700, 'sku': 'foo'}]
    

    And to get the averages, you can simply change the contents inside the for loop, like this

    temp_dict = dict(zip(["dept", "sku"], key))
    temp_list = [item["qty"] for item in grp]
    temp_dict["avg"] = sum(temp_list) / len(temp_list)
    result.append(temp_dict)
    

    Output

    [{'avg': 200, 'dept': '001', 'sku': 'bar'},
     {'avg': 200, 'dept': '001', 'sku': 'foo'},
     {'avg': 450, 'dept': '002', 'sku': 'baz'},
     {'avg': 600, 'dept': '002', 'sku': 'qux'},
     {'avg': 700, 'dept': '003', 'sku': 'foo'}]
    

    Suggestion: Anyway, I would have added both the qty and avg in the same dict like this

    temp_dict = dict(zip(["dept", "sku"], key))
    temp_list = [item["qty"] for item in grp]
    temp_dict["qty"] = sum(temp_list)
    temp_dict["avg"] = temp_dict["qty"] / len(temp_list)
    result.append(temp_dict)
    

    Output

    [{'avg': 200, 'dept': '001', 'qty': 200, 'sku': 'bar'},
     {'avg': 200, 'dept': '001', 'qty': 400, 'sku': 'foo'},
     {'avg': 450, 'dept': '002', 'qty': 900, 'sku': 'baz'},
     {'avg': 600, 'dept': '002', 'qty': 600, 'sku': 'qux'},
     {'avg': 700, 'dept': '003', 'qty': 700, 'sku': 'foo'}]
    
    0 讨论(0)
  • 2020-11-30 01:32

    Like always there are lots of valid solutions, I like the defaultdict one, since I find it easier to understand.

    from collections import defaultdict as df
    food = df(lambda:df(lambda:df(int)))
    for dct in input:  food[dct['transId']][dct['sku']][dct['dept']]=dct['qty']
    output_tupl=[(d1,d2,sum(food[d1][d2][d3] for d3 in food[d1][d2]) )for d1 in food for d2 in food[d1]]
    
    0 讨论(0)
  • 2020-11-30 01:35

    Using the numpy EP you can find here, you could write:

    inputs = dict( (k, [i[k] for i in input ]) for k in input[0].keys())
    print group_by((inputs['dept'], inputs['sku'])).mean(inputs['qty'])
    

    However, you may want to consider using the pandas package if you are doing a lot of relational operations of this kind.

    0 讨论(0)
提交回复
热议问题