apache-arrow

How to convert PyArrow table to Arrow table when interfacing between PyArrow in python and Arrow in C++

旧时模样 提交于 2021-02-08 04:37:27
问题 I have a C++ library which is built against the Apache Arrow C++ libraries, with a binding to python using Pybind. I'd like to be able to write a function in C++ to take a table constructed with PyArrow, like: void test(arrow::Table test); Passing in a PyArrow table like: tab = pa.Table.from_pandas(df) mybinding.test(tab) If I do a naive function as above, I get: TypeError: arrow_test(): incompatible function arguments. The following argument types are supported: 1. (arg0: arrow::Table) ->

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

半腔热情 提交于 2021-02-04 18:59:28
问题 I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer I'm fairly sure this is because one of the groups the pandas UDF receives is huge, and if I reduce the dataset and removes enough rows I can run my UDF with no problems. However, I want to run with my original dataset and even if I run this spark job on a machine with

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

丶灬走出姿态 提交于 2021-02-04 18:58:12
问题 I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer I'm fairly sure this is because one of the groups the pandas UDF receives is huge, and if I reduce the dataset and removes enough rows I can run my UDF with no problems. However, I want to run with my original dataset and even if I run this spark job on a machine with

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

◇◆丶佛笑我妖孽 提交于 2021-02-04 18:58:09
问题 I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer I'm fairly sure this is because one of the groups the pandas UDF receives is huge, and if I reduce the dataset and removes enough rows I can run my UDF with no problems. However, I want to run with my original dataset and even if I run this spark job on a machine with

How to get the arrow package for R with lz4 support?

一个人想着一个人 提交于 2021-01-28 22:20:38
问题 The R package arrow installed with install.packages('arrow') does not have lz4 support: codec_is_available('lz4') # [1] FALSE The package version is: packageVersion('arrow') # [1] ‘0.17.1’ This is on Ubuntu 20.04. How can I get an R arrow package with lz4 support? 回答1: According to the docs, you can use export LIBARROW_MINIMAL=false when building from source to make a build which supports compression: You can also install the R package from a git checkout: git clone https://github.com/apache

Do memory mapped files in Docker containers in Kubernetes work the same as in regular processes in Linux?

两盒软妹~` 提交于 2021-01-28 10:34:43
问题 I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing. Using mmap, process B is suppossed to read the file from memory instead of disk assuming process A has not called munmap. If I would like to deploy process A and process B to diferent containers in the same pod in Kubernetes, is memory mapped IO supposed to work the same way as the initial example? Should container B

Convert Pandas DataFrame to & from In-Memory Feather

家住魔仙堡 提交于 2020-01-11 09:59:12
问题 Using the IO tools in pandas it is possible to convert a DataFrame to an in-memory feather buffer: import pandas as pd from io import BytesIO df = pd.DataFrame({'a': [1,2], 'b': [3.0,4.0]}) buf = BytesIO() df.to_feather(buf) However, using the same buffer to convert back to a DataFrame pd.read_feather(buf) Results in an error: ArrowInvalid: Not a feather file How can a DataFrame be convert to an in-memory feather representation and, correspondingly, back to a DataFrame? Thank you in advance

Apache arrow, alignment and padding

不羁岁月 提交于 2020-01-06 03:15:06
问题 I want to use apache arrow because it enables execution engines to take advantage of the latest SIMD (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. (https://arrow.apache.org/). From documentration (https://arrow.apache.org/docs/memory_layout.html), I understand that memory allocation make sure about 64 byte alignment. In order to verify this 64 bytes alignment, I use the __array_interface__ data member