认识
Python 的itertools模块提供了很多节省内存的高效迭代器, 尤其解决了一些关于数据量太大而导致内存溢出(outofmemory)的场景.
我们平时用的循环绝大多数是这样的.
# while 循环: 求1+2+...100 s, i = 0, 1 while i <= 100: s += i i += 1 print('while-loop: the some of 1+2+..100 is:', s) # for 循环 s = 0 for i in range(101): s += i print('for-loop: the some of 1+2+..100 is:', s)
while-loop: the some of 1+2+..100 is: 5050 for-loop: the some of 1+2+..100 is: 5050
但如果数据量特别大的话就凉凉了, 所以引入了itertools,迭代器, 类似于懒加载的思想
常用API
- chain()
- groupby()
- accumulate()
- compress()
- takewhile()
- islice()
- repeat()
chain 拼接元素
- 把一组迭代对象串联起来,形成一个更大的迭代器:
# join / split s = "If you please draw me a sheep?" s1 = s.split() s2 = "-".join(s1) print("split->:", s1) print("join->:", s2)
split->: ['If', 'you', 'please', 'draw', 'me', 'a', 'sheep?'] join->: If-you-please-draw-me-a-sheep?
import itertools
# chain s = itertools.chain(['if', 'you'], ['please draw', 'me', 'a'], 'shape') s
<itertools.chain at 0x1d883602240>
list(s)
['if', 'you', 'please draw', 'me', 'a', 's', 'h', 'a', 'p', 'e']
不难发现, 这就是迭代器嘛, 真的没啥.跟join差不多. 那么它是如何节省内存的呢, 其实就是一个简单的迭代器思想, 一次读取一个元素进内存,这样就高效节约内存了呀
def chain(*iterables): for iter_ in iterables: for elem in iter_: yield elem
groupby 相邻元素
- 把迭代器中相邻的重复元素挑出来放在一
# 只要作用于函数的两个元素返回的值相等,这两个元素就被认为是在一组的,而函数返回值作为组的key for key, group in itertools.groupby('AAABBBCCAAAdde'): print(key, list(group))
A ['A', 'A', 'A'] B ['B', 'B', 'B'] C ['C', 'C'] A ['A', 'A', 'A'] d ['d', 'd'] e ['e']
# 忽略大小写 for key, group in itertools.groupby('AaaBBbcCAAa', lambda c: c.upper()): print(key, list(group))
A ['A', 'a', 'a'] B ['B', 'B', 'b'] C ['c', 'C'] A ['A', 'A', 'a']
accumulate 累积汇总
list(itertools.accumulate([1,2,3,4,5], lambda x,y: x*y))
[1, 2, 6, 24, 120]
# 伪代码 def accumulate(iterable, func=None, *, initial=None): iter_ = iter(iterable) ret = initial # 循环迭代 if initial is None: try: ret = next(iter_) except StopIteration: return yield ret # 遍历每个元素, 调用传入的函数去处理 for elem in iter_: ret = func(elem) yield ret
compress 过滤
list(itertools.compress('youge', [1,0,True,3]))
['y', 'u', 'g']
def compress(data, selectors): for d, s in zip(data, selectors): if s: return d # demo for data, key in zip([1,2], 'abcd'): print(data,key) if key: print(data)
1 a 1 2 b 2
# Pythonic def compress(data, selectors): return (d for d, s in zip(data, selectors) if s) # tset ret = compress(['love', 'you', 'forever'], ['love', None, 'dd', 'forever']) print(ret) print(list(ret))
<generator object compress.<locals>.<genexpr> at 0x000001D8831498E0> ['love', 'forever']
生成器
- 在类中实现了iter()方法和next()方法的对象即生成器
- 代码上有两种形式: 元组生成器 或者 函数中出现 yield 关键字
zip
- 对应位置进行元素拼接, 当最短的匹配上了, 则停止, 也被称为"拉长函数"
take-while
- takewhile: 依次迭代, 满足条件则返回, 继续迭代, 一旦不满足条件则退出
# takewhile s1 = list(itertools.takewhile(lambda x:x<=2, [0,3,2,1,-1,3,0])) print(s1) s2 = list(itertools.takewhile(lambda x:x<5, [1,4,6,4,1,3])) print(s2) # dropwhile s3 = list(itertools.filterfalse(lambda x:x%2==0, range(10))) print(s3)
[0] [1, 4] [1, 3, 5, 7, 9]
def take_while(condition, iter_obj): for elem in iter_obj: if conditon(elem): yield elem else: break
dropwhile: 不满足条件的则返回
islice 切片
# 普通的切片,也是要先全部读入内存 # 注意是深拷贝的哦 l = [1,2,3,4,5] print(l[::--1]) # generator 方式 # 默认的 start, stop, step, 只能传0或正数, 但可以自己改写的呀 list(itertools.islice(l, 0,3,1)) s = slice(3,4,5) # 只接收3个参数 s.start s.stop
[1, 2, 3, 4, 5]
[1, 2, 3]
3
4
import sys def slice(iter_obj, *args): s = slice(*args) start = s.start or 0 stop = s.stop or sys.maxsize # 很大的常量 step = s.step or 1 # 构成可迭代的对象(下标) iter_ = iter(range(start, stop, step)) try: next_i = next(iter_) except StopIteration: # for i, elem n zip(range(start), iter_obj): pass return try: i, elem in enumerate(iter_obj): if i == next_i: yield elem next_i = next(elem) except StopIteration: pass
[1, 2, 3, 4, 5]
repeat
list(itertools.repeat(['youge'], 3))
[['youge'], ['youge'], ['youge']]
def repeat(obj, times=None): if times is None: while True: # 一直返回 yield obj else: for i in range(times): yield obj