Apache arrow, alignment and padding

不羁岁月 提交于 2020-01-06 03:15:06

问题


I want to use apache arrow because it enables execution engines to take advantage of the latest SIMD (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. (https://arrow.apache.org/).

From documentration (https://arrow.apache.org/docs/memory_layout.html), I understand that memory allocation make sure about 64 byte alignment.

In order to verify this 64 bytes alignment, I use the __array_interface__ data member of a numpy array that points to the data-area storing the array contents and compute a modulo 64 on it. If the result is 0 then the memory address is aligned on at least 64 Bytes.

When I execute the code bellow, on my system (Fedora) it seems to work (the result of modulo 64 is zero) but when I execute the same code on my colleague's system (Fedora too) it does not work: the result of modulo 64 is not zero. So the memory is not aligned on 64 bytes.

Please find my code here:

import pyarrow as pa

tab=pa.array([[1, 2], [3, 4]])

panda_array=tab.to_pandas()

print('numpy address {} modulo 64 => {}'.format(panda_array.__array_interface__['data'][0], panda_array.__array_interface__['data'][0]%64))

Thank you for your help.


回答1:


The memory in Arrow is 64 byte aligned but in your example code, the conversion to Pandas/NumPy makes a copy of the data as a nested array of lists is differently represented in Arrow and in NumPy. In Arrow this is done using one buffer that holds the data of all lists while there is another buffer that holds the offsets for each list in that Array. As NumPy has no native list type, it is represented as a NumPy array that contains other NumPy arrays as elements. These are represented in the first NumPy array as Python objects.

Thus using the NumPy functions you see the memory as allocated by NumPy, not by Arrow. Thus if your memory address is on a 64 byte boundary, it is only by chance.

In the next version (0.9) of pyarrow there will be a buffers property to access the underlying memory addresses. You should then be able to directly check if the Arrow memory is allocated on a 64 byte aligned address (it always should be).



来源:https://stackoverflow.com/questions/48830024/apache-arrow-alignment-and-padding

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!