Correct use of a fold or reduce function to long-to-wide data in python or javascript?

为君一笑 提交于 2019-12-10 22:31:45

问题


Trying to learn to think like a functional programmer a little more---I'd like to transform a data set with what I think is either a fold or a reduce operation. In R, I would think of this as a reshape operation, but I'm not sure how to translate that thinking.

My data is a json string that looks like this:

s = 
'[
{"query":"Q1", "detail" : "cool", "rank":1,"url":"awesome1"},
{"query":"Q1", "detail" : "cool", "rank":2,"url":"awesome2"},
{"query":"Q1", "detail" : "cool", "rank":3,"url":"awesome3"},
{"query":"Q#2", "detail" : "same", "rank":1,"url":"newurl1"},
{"query":"Q#2", "detail" : "same", "rank":2,"url":"newurl2"},
{"query":"Q#2", "detail" : "same", "rank":3,"url":"newurl3"}
]'

I'd like to turn it into something like this, where query is the master key defining the 'row', nesting the unique "rows" corresponding to the "rank" values and "url" fields:

'[
{ "query" : "Q1",
    "results" : [
        {"rank" : 1, "url": "awesome1"},
        {"rank" : 2, "url": "awesome2"},
        {"rank" : 3, "url": "awesome3"}        
    ]},
{ "query" : "Q#2",
    "results" : [
        {"rank" : 1, "url": "newurl1"},
        {"rank" : 2, "url": "newurl2"},
        {"rank" : 3, "url": "newurl3"},        
    ]}
]'

I know I can iterate through, but I suspect there is a functional operation that does this transformation, right?

Would also be curious to know how to get to something more like this, Version2:

'[
{ "query" : "Q1",
    "Common to all results" : [
        {"detail" : "cool"}
    ],
    "results" : [
        {"rank" : 1, "url": "awesome1"},
        {"rank" : 2, "url": "awesome2"},
        {"rank" : 3, "url": "awesome3"}        
    ]},
{ "query" : "Q#2",
    "Common to all results" : [
        {"detail" : "same"}
    ],
    "results" : [
        {"rank" : 1, "url": "newurl1"},
        {"rank" : 2, "url": "newurl2"},
        {"rank" : 3, "url": "newurl3"}        
    ]}
]'

In this second version, I'd like to take all data repeating under the same query, and shove it into an "other stuff" container, where all the items unique under "rank" would be in the "results" container.

I'm working on json objects in mongodb, and can use either python or javascript to try out this transform.

Any advice, such as the proper name for this transformation, what might be the fastest way to do this on a large data set, is appreciated!

EDIT

Incorporating @abarnert's excellent solution below, trying to get my Version2 above for anyone else working on the same kind of problem, requiring bifurcating some keys under one level, other keys under another...

Here's what I tried:

from functools import partial
groups = itertools.groupby(initial, operator.itemgetter('query'))
def filterkeys(d,mylist):
    return {k: v for k, v in d.items() if k in mylist}

results = ((key, map(partial(filterkeys, mylist=['rank','url']),group)) for key, group in groups)
other_stuff = ((key, map(partial(filterkeys, mylist=['detail']),group)) for key, group in groups)

???

Oh no!


回答1:


I know this isn't the fold-style solution you were asking for, but I would do this with itertools, which is just as functional (unless you think Haskell is less functional than Lisp…), and also probably the most pythonic way to solve this.

The idea is to think of your sequence as a lazy list, and apply a chain of lazy transformations to it until you get the list you want.

The key step here is groupby:

>>> initial = json.loads(s)
>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> print([key, list(group) for key, group in groups])
[('Q1',
  [{'detail': 'cool', 'query': 'Q1', 'rank': 1, 'url': 'awesome1'},
   {'detail': 'cool', 'query': 'Q1', 'rank': 2, 'url': 'awesome2'},
   {'detail': 'cool', 'query': 'Q1', 'rank': 3, 'url': 'awesome3'}]),
 ('Q#2',
  [{'detail': 'same', 'query': 'Q#2', 'rank': 1, 'url': 'newurl1'},
   {'detail': 'same', 'query': 'Q#2', 'rank': 2, 'url': 'newurl2'},
   {'detail': 'same', 'query': 'Q#2', 'rank': 3, 'url': 'newurl3'}])]

You can see how close we are already, in just one step.

To restructure each key, group pair into the dict format you want:

>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> print([{"query": key, "results": list(group)} for key, group in groups])
[{'query': 'Q1',
  'results': [{'detail': 'cool',
               'query': 'Q1',
               'rank': 1,
               'url': 'awesome1'},
              {'detail': 'cool',
               'query': 'Q1',
               'rank': 2,
               'url': 'awesome2'},
              {'detail': 'cool',
               'query': 'Q1',
               'rank': 3,
               'url': 'awesome3'}]},
 {'query': 'Q#2',
  'results': [{'detail': 'same',
               'query': 'Q#2',
               'rank': 1,
               'url': 'newurl1'},
              {'detail': 'same',
               'query': 'Q#2',
               'rank': 2,
               'url': 'newurl2'},
              {'detail': 'same',
               'query': 'Q#2',
               'rank': 3,
               'url': 'newurl3'}]}]

But wait, there's still those extra fields you want to get rid of. Easy:

>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> def filterkeys(d):
...     return {k: v for k, v in d.items() if k in ('rank', 'url')}
>>> filtered = ((key, map(filterkeys, group)) for key, group in groups)
>>> print([{"query": key, "results": list(group)} for key, group in filtered])
[{'query': 'Q1',
  'results': [{'rank': 1, 'url': 'awesome1'},
              {'rank': 2, 'url': 'awesome2'},
              {'rank': 3, 'url': 'awesome3'}]},
 {'query': 'Q#2',
  'results': [{'rank': 1, 'url': 'newurl1'},
              {'rank': 2, 'url': 'newurl2'},
              {'rank': 3, 'url': 'newurl3'}]}]

The only thing left to do is to call json.dumps instead of print.


For your followup, you want to take all values that are identical across every row with the same query and group them into otherstuff, and then list whatever remains in the results.

So, for each group, first we want to get the common keys. We can do this by iterating the keys of any member of the group (anything that's not in the first member can't be in all members), so:

def common_fields(group):
    def in_all_members(key, value):
        return all(member[key] == value for member in group[1:])
    return {key: value for key, value in group[0].items() if in_all_members(key, value)}

Or, alternatively… if we turn each member into a set of key-value pairs, instead of a dict, we can then just intersect them all. And this means we finally get to use reduce, so let's try that:

def common_fields(group):
    return dict(functools.reduce(set.intersection, (set(d.items()) for d in group)))

I think the conversion back and forth between dict and set may make this less readable, and it also means that your values have to be hashable (not a problem for you sample data, since the values are all strings)… but it's certainly more concise.

This will, of course, always include query as a common field, but we'll deal with that later. (Also, you wanted otherstuff to be a list with one dict, so we'll throw an extra pair of brackets around it).

Meanwhile, results is the same as above, except that filterkeys filters out all of the common fields, instead of filtering out everything but rank and url. Putting it together:

def process_group(group):
    group = list(group)
    common = dict(functools.reduce(set.intersection, (set(d.items()) for d in group)))
    def filterkeys(member):
        return {k: v for k, v in member.items() if k not in common}
    results = list(map(filterkeys, group))
    query = common.pop('query')
    return {'query': query,
            'otherstuff': [common],
            'results': list(results)}

So, now we just use that function:

>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> print([process_group(group) for key, group in groups])
[{'otherstuff': [{'detail': 'cool'}],
  'query': 'Q1',
  'results': [{'rank': 1, 'url': 'awesome1'},
              {'rank': 2, 'url': 'awesome2'},
              {'rank': 3, 'url': 'awesome3'}]},
 {'otherstuff': [{'detail': 'same'}],
  'query': 'Q#2',
  'results': [{'rank': 1, 'url': 'newurl1'},
              {'rank': 2, 'url': 'newurl2'},
              {'rank': 3, 'url': 'newurl3'}]}]

This obviously isn't as trivial as the original version, but hopefully it all still makes sense. There are only two new tricks. First, we have to iterate over groups multiple times (once to find the common keys, and then again to extract the remaining keys)



来源:https://stackoverflow.com/questions/16154847/correct-use-of-a-fold-or-reduce-function-to-long-to-wide-data-in-python-or-javas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!