pyarrow

Over-high memory usage during reading parquet in Python

与世无争的帅哥 提交于 2020-08-08 04:54:10
问题 I have a parquet file at around 10+GB, with columns are mainly strings. When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back to around 40G. I'm working on a high-performance computer with allocated memory so I do have access to large memory. However, it seems a waste to me that I have to apply for a 128G memory just for loading data, after that 64G is sufficient for me. Also, 128G memory is more often to be out of order

Reading csv file from hdfs using dask and pyarrow

狂风中的少年 提交于 2020-07-23 10:56:07
问题 We are trying out dask_yarn version 0.3.0 (with dask 0.18.2) because of the conflicts between the boost-cpp i'm running with pyarrow version 0.10.0 We are trying to read a csv file from hdfs - however we get an error when running dd.read_csv('hdfs:///path/to/file.csv') since it is trying to use hdfs3. ImportError: Can not find the shared library: libhdfs3.so From the documentation it seems that there is an option to use pyarrow . What is the correct syntax/configuration to do so? 回答1: Try

Reading csv file from hdfs using dask and pyarrow

放肆的年华 提交于 2020-07-23 10:54:05
问题 We are trying out dask_yarn version 0.3.0 (with dask 0.18.2) because of the conflicts between the boost-cpp i'm running with pyarrow version 0.10.0 We are trying to read a csv file from hdfs - however we get an error when running dd.read_csv('hdfs:///path/to/file.csv') since it is trying to use hdfs3. ImportError: Can not find the shared library: libhdfs3.so From the documentation it seems that there is an option to use pyarrow . What is the correct syntax/configuration to do so? 回答1: Try

Pyarrow read/write from s3

ε祈祈猫儿з 提交于 2020-07-19 03:32:16
问题 Is it possible to read and write parquet files from one folder to another folder in s3 without converting into pandas using pyarrow. Here is my code: import pyarrow.parquet as pq import pyarrow as pa import s3fs s3 = s3fs.S3FileSystem() bucket = 'demo-s3' pd = pq.ParquetDataset('s3://{0}/old'.format(bucket), filesystem=s3).read(nthreads=4).to_pandas() table = pa.Table.from_pandas(pd) pq.write_to_dataset(table, 's3://{0}/new'.format(bucket), filesystem=s3, use_dictionary=True, compression=

Google BigQuery Schema conflict (pyarrow error) with Numeric data type using load_table_from_dataframe

江枫思渺然 提交于 2020-07-10 08:44:06
问题 I got the following error when I upload numeric data (int64 or float64) from a Pandas dataframe to a "Numeric" Google BigQuery Data Type: pyarrow.lib.ArrowInvalid: Got bytestring of length 8 (expected 16) I tried to change the datatype of 'tt' field from Pandas dataframe without results: df_data_f['tt'] = df_data_f['tt'].astype('float64') and df_data_f['tt'] = df_data_f['tt'].astype('int64') Using the schema: job_config.schema = [ ... bigquery.SchemaField('tt', 'NUMERIC') ...] Reading this

Python pandas_udf spark error

こ雲淡風輕ζ 提交于 2020-07-05 10:36:08
问题 I started playing around with spark locally and finding this weird issue 1) pip install pyspark==2.3.1 2) pyspark> import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType, udf df = pd.DataFrame({'x': [1,2,3], 'y':[1.0,2.0,3.0]}) sp_df = spark.createDataFrame(df) @pandas_udf('long', PandasUDFType.SCALAR) def pandas_plus_one(v): return v + 1 sp_df.withColumn('v2', pandas_plus_one(sp_df.x)).show() Taking this example from here https://databricks.com/blog/2017/10/30

PySpark 2.4.5: IllegalArgumentException when using PandasUDF

送分小仙女□ 提交于 2020-06-12 08:01:13
问题 I am trying Pandas UDF and facing the IllegalArgumentException. I also tried replicating examples from PySpark Documentation GroupedData to check but still getting the error. Following is the environment configuration python3.7 Installed PySpark==2.4.5 using pip Installed PyArrow==0.16.0 using pip from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf('int', PandasUDFType.GROUPED_AGG) def min_udf(v): return v.min() sorted(gdf.agg(min_udf(df.age)).collect()) Output

Pandas to parquet NOT into file-system but get content of resulting file in variable

泄露秘密 提交于 2020-05-15 09:04:52
问题 There are several ways how a conversion from pandas to parquet is possible. e.g. pyarrow.Table.from_pandas or dataframe.to_parquet . What they have in common is that they get as a parameter a filePath where the df.parquet should be stored. I need to get the content of the written parquet file into a variable and have not seen this, yet. Mainly I want the same behavior as pandas.to_csv which returns the result as a string if no path is provided. Of course I could just write the file and read

Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

北城以北 提交于 2020-04-30 16:38:22
问题 Using AWS Firehose I am converting incoming records to parquet. In one example, I have 150k identical records enter firehose, and a single 30kb parquet gets written to s3. Because of how firehose partitions data, we have a secondary process (lambda triggered by s3 put event) read in the parquet and repartitions it based on the date within the event itself. After this repartitioning process, the 30kb file size jumps to 900kb. Inspecting both parquet files- The meta doesn't change The data