上一遍源码分析,关注swift-ring-bin文件,其中最为复杂,也是最为重要操作要数rebalance方法了,它是用来重新生成ring文件,再你修改builder文件后(例如增减设备)使系统中的partition分布平衡(当然,在rebalance后,需要重新启动系统的各个服务)。其中一致性的哈希算法,副本的概念,zone的概念,weight的概念都是通过它来实现的。
源码片段:
swift-ring-builder rebalance方法。
def rebalance():
"""
swift-ring-builder <builder_file> rebalance
Attempts to rebalance the ring by reassigning partitions that haven't been
recently reassigned.
"""
devs_changed = builder.devs_changed #devs_changed代表builder中的devs是否改变,默认是Flase,当调用add_dev,set_dev_weight,remove_dev,会把devs_changed设置为True。
try:
last_balance = builder.get_balance()#调用builder.get_balance方法,返回ring的banlance 也就是平衡度 例如0.83%。
parts, balance = builder.rebalance()#主要的重平衡方法,返回重新分配的partition的数目和新的balance。
except exceptions.RingBuilderError, e:
print '-' * 79
print ("An error has occurred during ring validation. Common\n"
"causes of failure are rings that are empty or do not\n"
"have enough devices to accommodate the replica count.\n"
"Original exception message:\n %s" % e.message
)
print '-' * 79
exit(EXIT_ERROR)
if not parts:
print 'No partitions could be reassigned.'
print 'Either none need to be or none can be due to ' \
'min_part_hours [%s].' % builder.min_part_hours
exit(EXIT_WARNING)
if not devs_changed and abs(last_balance - balance) < 1:
print 'Cowardly refusing to save rebalance as it did not change ' \
'at least 1%.'
exit(EXIT_WARNING)
try:
builder.validate()#安全功能方法,捕捉bugs,确保partition发配到真正的device上,不被分配两次等等一些功能。
except exceptions.RingValidationError, e:
print '-' * 79
print ("An error has occurred during ring validation. Common\n"
"causes of failure are rings that are empty or do not\n"
"have enough devices to accommodate the replica count.\n"
"Original exception message:\n %s" % e.message
)
print '-' * 79
exit(EXIT_ERROR)
print 'Reassigned %d (%.02f%%) partitions. Balance is now %.02f.' % \
(parts, 100.0 * parts / builder.parts, balance)#打印rebalance结果
status = EXIT_SUCCESS
if balance > 5: #balnce大于5会提示,最小的系统平衡时间。
print '-' * 79
print 'NOTE: Balance of %.02f indicates you should push this ' % \
balance
print ' ring, wait at least %d hours, and rebalance/repush.' \
% builder.min_part_hours
print '-' * 79
status = EXIT_WARNING
ts = time()#截取时间。
builder.get_ring().save( #保存新生成的builder ring文件
pathjoin(backup_dir, '%d.' % ts + basename(ring_file)))
pickle.dump(builder.to_dict(), open(pathjoin(backup_dir,
'%d.' % ts + basename(argv[1])), 'wb'), protocol=2)
builder.get_ring().save(ring_file)
pickle.dump(builder.to_dict(), open(argv[1], 'wb'), protocol=2)
exit(status)
其中我加入了一些自己的注释,方便理解。实际上是调用了builder.py中的rebalance方法。
builder.py 中的rebalance方法:
def rebalance(self):
"""
Rebalance the ring.
This is the main work function of the builder, as it will assign and
reassign partitions to devices in the ring based on weights, distinct
zones, recent reassignments, etc.
The process doesn't always perfectly assign partitions (that'd take a
lot more analysis and therefore a lot more time -- I had code that did
that before). Because of this, it keeps rebalancing until the device
skew (number of partitions a device wants compared to what it has) gets
below 1% or doesn't change by more than 1% (only happens with ring that
can't be balanced no matter what -- like with 3 zones of differing
weights with replicas set to 3).
:returns: (number_of_partitions_altered, resulting_balance)
"""
self._ring = None #令实例中的ring为空
if self._last_part_moves_epoch is None:
self._initial_balance() #增加一些初始化设置的balance方法,
self.devs_changed = False
return self.parts, self.get_balance()
retval = 0
self._update_last_part_moves()#更新part moved时间。
last_balance = 0
while True:
reassign_parts = self._gather_reassign_parts()#返回一个list(part,replica)对,需要重新分配。
self._reassign_parts(reassign_parts) #重新分配的实际动作
retval += len(reassign_parts)
while self._remove_devs:
self.devs[self._remove_devs.pop()['id']] = None #删除相应的dev
balance = self.get_balance()#获取新的平衡比
if balance < 1 or abs(last_balance - balance) < 1 or \
retval == self.parts:
break
last_balance = balance
self.devs_changed = False
self.version += 1
return retval, balance
程序会根据_last_part_moves_epoch是否为None来决定,程序执行的路线。如果为None(说明是第一次rebalance),程序会调用_initial_balance()方法,然后返回结果,其实它的操作跟_last_part_moves_epoch不为None时,进行的操作大体相同,只是_initial_balance会做一些初始化的操作。而真正执行rebalance操作动作的是_reassign_parts方法。
builder.py中的_reassign_parts分配part的动作方法。
def _reassign_parts(self, reassign_parts):
"""
For an existing ring data set, partitions are reassigned similarly to
the initial assignment. The devices are ordered by how many partitions
they still want and kept in that order throughout the process. The
gathered partitions are iterated through, assigning them to devices
according to the "most wanted" while keeping the replicas as "far
apart" as possible. Two different zones are considered the
farthest-apart things, followed by different ip/port pairs within a
zone; the least-far-apart things are different devices with the same
ip/port pair in the same zone.
If you want more replicas than devices, you won't get all your
replicas.
:param reassign_parts: An iterable of (part, replicas_to_replace)
pairs. replicas_to_replace is an iterable of the
replica (an int) to replace for that partition.
replicas_to_replace may be shared for multiple
partitions, so be sure you do not modify it.
"""
for dev in self._iter_devs():
dev['sort_key'] = self._sort_key_for(dev)#设置每一个dev的sort_key
available_devs = \ #迭代出可用的devs根据sort_key排序
sorted((d for d in self._iter_devs() if d['weight']),
key=lambda x: x['sort_key'])
tier2children = build_tier_tree(available_devs)#生产层结构devs
tier2devs = defaultdict(list)#devs层
tier2sort_key = defaultdict(list)#sort_key层
tiers_by_depth = defaultdict(set)#深度层
for dev in available_devs:#安装不同方式分类排序。
for tier in tiers_for_dev(dev):
tier2devs[tier].append(dev) # <-- starts out sorted!
tier2sort_key[tier].append(dev['sort_key'])
tiers_by_depth[len(tier)].add(tier)
for part, replace_replicas in reassign_parts:
# Gather up what other tiers (zones, ip_ports, and devices) the
# replicas not-to-be-moved are in for this part.
other_replicas = defaultdict(lambda: 0)#不同的zone ip_port device_id标识
for replica in xrange(self.replicas):
if replica not in replace_replicas:
dev = self.devs[self._replica2part2dev[replica][part]]
for tier in tiers_for_dev(dev):
other_replicas[tier] += 1#不需要重新分配的会被+1
def find_home_for_replica(tier=(), depth=1):
# Order the tiers by how many replicas of this
# partition they already have. Then, of the ones
# with the smallest number of replicas, pick the
# tier with the hungriest drive and then continue
# searching in that subtree.
#
# There are other strategies we could use here,
# such as hungriest-tier (i.e. biggest
# sum-of-parts-wanted) or picking one at random.
# However, hungriest-drive is what was used here
# before, and it worked pretty well in practice.
#
# Note that this allocator will balance things as
# evenly as possible at each level of the device
# layout. If your layout is extremely unbalanced,
# this may produce poor results.
candidate_tiers = tier2children[tier]#逐层的找最少的part
min_count = min(other_replicas[t] for t in candidate_tiers)
candidate_tiers = [t for t in candidate_tiers
if other_replicas[t] == min_count]
candidate_tiers.sort(
key=lambda t: tier2sort_key[t][-1])
if depth == max(tiers_by_depth.keys()):
return tier2devs[candidate_tiers[-1]][-1]
return find_home_for_replica(tier=candidate_tiers[-1],
depth=depth + 1)
for replica in replace_replicas:#对于要分配的dev做相应的处理
dev = find_home_for_replica()
dev['parts_wanted'] -= 1
dev['parts'] += 1
old_sort_key = dev['sort_key']
new_sort_key = dev['sort_key'] = self._sort_key_for(dev)
for tier in tiers_for_dev(dev):
other_replicas[tier] += 1
index = bisect.bisect_left(tier2sort_key[tier],
old_sort_key)
tier2devs[tier].pop(index)
tier2sort_key[tier].pop(index)
new_index = bisect.bisect_left(tier2sort_key[tier],
new_sort_key)
tier2devs[tier].insert(new_index, dev)
tier2sort_key[tier].insert(new_index, new_sort_key)
self._replica2part2dev[replica][part] = dev['id']#某个part的某个replica分配到dev['id']
# Just to save memory and keep from accidental reuse.
for dev in self._iter_devs():
del dev['sort_key']
这个函数实现了重新分配的功能,其中重要的概念是三层结构,也就是utrls.py文件,会针对一个dev 或者一个devs,返回三层结构的字典。
源码中给我们举了一个例子:
Example:
zone 1 -+---- 192.168.1.1:6000 -+---- device id 0
| |
| +---- device id 1
| |
| +---- device id 2
|
+---- 192.168.1.2:6000 -+---- device id 3
|
+---- device id 4
|
+---- device id 5
zone 2 -+---- 192.168.2.1:6000 -+---- device id 6
| |
| +---- device id 7
| |
| +---- device id 8
|
+---- 192.168.2.2:6000 -+---- device id 9
|
+---- device id 10
|
+---- device id 11
The tier tree would look like:
{
(): [(1,), (2,)],
(1,): [(1, 192.168.1.1:6000),
(1, 192.168.1.2:6000)],
(2,): [(1, 192.168.2.1:6000),
(1, 192.168.2.2:6000)],
(1, 192.168.1.1:6000): [(1, 192.168.1.1:6000, 0),
(1, 192.168.1.1:6000, 1),
(1, 192.168.1.1:6000, 2)],
(1, 192.168.1.2:6000): [(1, 192.168.1.2:6000, 3),
(1, 192.168.1.2:6000, 4),
(1, 192.168.1.2:6000, 5)],
(2, 192.168.2.1:6000): [(1, 192.168.2.1:6000, 6),
(1, 192.168.2.1:6000, 7),
(1, 192.168.2.1:6000, 8)],
(2, 192.168.2.2:6000): [(1, 192.168.2.2:6000, 9),
(1, 192.168.2.2:6000, 10),
(1, 192.168.2.2:6000, 11)],
}
通过zone,ip_port,device_id 分成三层,之后的操作会根据层次,进行相关的操作(这其中就实现了zone,副本等概念)。
这样一个ring rebalance操作就做好了,最后会保存新的 builder文件,和ring文件,ring文件时根据生产的builder文件调用了RingData类中的方法保存的比较简单,这里不做分析。
这样大体上就分析了swift-ring-builder, /swift/common/ring/下的文件,其中具体的函数具体的功能与实现,可以查看源码。下一篇文章我会分析一下swift-init,用通过start方法来说明服务启动的流程。
来源:oschina
链接:https://my.oschina.net/u/243681/blog/80931