Python: flatten nested lists with indices

后端未结

关注

 3  973

被撕碎了的回忆 2021-02-13 16:01

Given a list of arbitrairly deep nested lists of arbitrary size, I would like an flat, depth-first iterator over all elements in the tree, but with path indicies as well such th

3条回答

鱼传尺愫 (楼主)

2021-02-13 16:32

Starting with direct recursion and state variables with default values,

def flatten (l, i = 0, path = (), acc = []):
  if not l:
    return acc
  else:
    first, *rest = l
    if isinstance (first, list):
      return flatten (first, 0, path + (i,), acc) + flatten (rest, i + 1, path, [])
    else:
      return flatten (rest, i + 1, path, acc + [ (first, path + (i,)) ])

print (flatten (L))
# [ (1, (0, 0, 0))
# , (2, (0, 0, 1))
# , (3, (0, 0, 2))
# , (4, (0, 1, 0))
# , (5, (0, 1, 1))
# , (6, (1, 0))
# , (7, (2, 0))
# , (8, (2, 1, 0))
# , (9, (2, 1, 1))
# , (10, (3,))
# ]

The program above shares the same weakness as yours; it is not safe for deep lists. We can use continuation-passing style to make it tail recursive – changes in bold

def identity (x):
  return x

# tail-recursive, but still not stack-safe, yet
def flatten (l, i = 0, path = (), acc = [], cont = identity):
  if not l:
    return cont (acc)
  else:
    first, *rest = l
    if isinstance (first, list):
      return flatten (first, 0, path + (i,), acc, lambda left:
        flatten (rest, i + 1, path, [], lambda right:
          cont (left + right)))
    else:
      return flatten (rest, i + 1, path, acc + [ (first, path + (i,)) ], cont)


print (flatten (L))
# [ (1, (0, 0, 0))
# , (2, (0, 0, 1))
# , (3, (0, 0, 2))
# , (4, (0, 1, 0))
# , (5, (0, 1, 1))
# , (6, (1, 0))
# , (7, (2, 0))
# , (8, (2, 1, 0))
# , (9, (2, 1, 1))
# , (10, (3,))
# ]

Finally, we replace the recursive calls with our own call mechanism. This effectively sequences the recursive calls and now it works for data of any size and any level of nesting. This technique is called a trampoline – changes in bold

def identity (x):
  return x

def flatten (l):
  def loop (l, i = 0, path = (), acc = [], cont = identity):  
    if not l:
      return cont (acc)
    else:
      first, *rest = l
      if isinstance (first, list):
        return call (loop, first, 0, path + (i,), acc, lambda left:
          call (loop, rest, i + 1, path, [], lambda right:
            cont (left + right)))
      else:
        return call (loop, rest, i + 1, path, acc + [ (first, path + (i,)) ], cont)

  return loop (l) .run ()

class call:
  def __init__ (self, f, *xs):
    self.f = f
    self.xs = xs

  def run (self):
    acc = self
    while (isinstance (acc, call)):
      acc = acc.f (*acc.xs)
    return acc

print (flatten (L))
# [ (1, (0, 0, 0))
# , (2, (0, 0, 1))
# , (3, (0, 0, 2))
# , (4, (0, 1, 0))
# , (5, (0, 1, 1))
# , (6, (1, 0))
# , (7, (2, 0))
# , (8, (2, 1, 0))
# , (9, (2, 1, 1))
# , (10, (3,))
# ]

Why is it better? Objectively speaking, it's a more complete program. Just because it appears more complex doesn't mean it is less efficient.

The code provided in the question fails when the input list is nested more then 996 levels deep (in python 3.x)

depth = 1000
L = [1]
while (depth > 0):
  L = [L]
  depth = depth - 1

for x in flatten (L):
  print (x)

# Bug in the question's code:
# the first value in the tuple is not completely flattened
# ([[[[[1]]]]], (0, 0, 0, ... ))

Worse, when depth increases to around 2000, the code provided in the question generates a run time error GeneratorExitException.

When using my program, it works for inputs of any size, nested to any depth, and always produces the correct output.

depth = 50000
L = [1]
while (depth > 0):
  L = [L]
  depth = depth - 1

print (flatten (L))
# (1, (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 49990 more...))

print (flatten (range (50000)))
# [ (0, (0,))
# , (1, (1,))
# , (2, (2,))
# , ...
# , (49999, (49999,))
# ]

Who would have such a deep list anyway? One such common case is the linked list which creates deep, tree-like structures

my_list = [ 1, [ 2, [ 3, [ 4, None ] ] ] ]

Such a structure is common because the the outermost pair gives us easy access to the two semantic parts we care about: the first item, and the rest of the items. The linked list could be implemented using tuple or dict as well.

my_list = ( 1, ( 2, ( 3, ( 4, None ) ) ) )

my_list = { "first": 1
          , "rest": { "first": 2
                    , "rest": { "first": 3
                              , "rest": { "first": 4
                                        , "rest": None
                                        }
                              }
                    }
          }

Above, we can see that a sensible structure potentially creates a significant depth. In Python, [], (), and {} allow you to nest infinitely. Why should our generic flatten restrict that freedom?

It's my opinion that if you're going to design a generic function like flatten, we should choose the implementation that works in the most cases and has the fewest surprises. One that suddenly fails just because a certain (deep) structure is used is bad. The flatten used in my answer is not the fastest^[1], but it doesn't surprise the programmer with strange answers or program crashes.

^[1] I don't measure performance until it matters, and so I haven't done anything to tune flatten above. Another understated advantage of my program is that you can tune it because we wrote it – On the other hand, if for, enumerate and yield caused problems in your program, what would you do to "fix" it? How would we make it faster? How would we make it work for inputs of greater size or depth? What good is a Ferrari after it wrapped around a tree?

0 讨论(0)

查看其它3个回答