pyarrow | 易学教程

Pandas Read/Write Parquet Data using Column Index

阅读更多关于 Pandas Read/Write Parquet Data using Column Index

来源： https://stackoverflow.com/questions/62252259/pandas-read-write-parquet-data-using-column-index

Over-high memory usage during reading parquet in Python

阅读更多关于 Over-high memory usage during reading parquet in Python

问题 I have a parquet file at around 10+GB, with columns are mainly strings. When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back to around 40G. I'm working on a high-performance computer with allocated memory so I do have access to large memory. However, it seems a waste to me that I have to apply for a 128G memory just for loading data, after that 64G is sufficient for me. Also, 128G memory is more often to be out of order

Reading csv file from hdfs using dask and pyarrow

阅读更多关于 Reading csv file from hdfs using dask and pyarrow

问题 We are trying out dask_yarn version 0.3.0 (with dask 0.18.2) because of the conflicts between the boost-cpp i'm running with pyarrow version 0.10.0 We are trying to read a csv file from hdfs - however we get an error when running dd.read_csv('hdfs:///path/to/file.csv') since it is trying to use hdfs3. ImportError: Can not find the shared library: libhdfs3.so From the documentation it seems that there is an option to use pyarrow . What is the correct syntax/configuration to do so? 回答1: Try

Reading csv file from hdfs using dask and pyarrow

阅读更多关于 Reading csv file from hdfs using dask and pyarrow

Pyarrow read/write from s3

阅读更多关于 Pyarrow read/write from s3

问题 Is it possible to read and write parquet files from one folder to another folder in s3 without converting into pandas using pyarrow. Here is my code: import pyarrow.parquet as pq import pyarrow as pa import s3fs s3 = s3fs.S3FileSystem() bucket = 'demo-s3' pd = pq.ParquetDataset('s3://{0}/old'.format(bucket), filesystem=s3).read(nthreads=4).to_pandas() table = pa.Table.from_pandas(pd) pq.write_to_dataset(table, 's3://{0}/new'.format(bucket), filesystem=s3, use_dictionary=True, compression=

Google BigQuery Schema conflict (pyarrow error) with Numeric data type using load_table_from_dataframe

阅读更多关于 Google BigQuery Schema conflict (pyarrow error) with Numeric data type using load_table_from_dataframe

问题 I got the following error when I upload numeric data (int64 or float64) from a Pandas dataframe to a "Numeric" Google BigQuery Data Type: pyarrow.lib.ArrowInvalid: Got bytestring of length 8 (expected 16) I tried to change the datatype of 'tt' field from Pandas dataframe without results: df_data_f['tt'] = df_data_f['tt'].astype('float64') and df_data_f['tt'] = df_data_f['tt'].astype('int64') Using the schema: job_config.schema = [ ... bigquery.SchemaField('tt', 'NUMERIC') ...] Reading this

Python pandas_udf spark error

阅读更多关于 Python pandas_udf spark error

问题 I started playing around with spark locally and finding this weird issue 1) pip install pyspark==2.3.1 2) pyspark> import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType, udf df = pd.DataFrame({'x': [1,2,3], 'y':[1.0,2.0,3.0]}) sp_df = spark.createDataFrame(df) @pandas_udf('long', PandasUDFType.SCALAR) def pandas_plus_one(v): return v + 1 sp_df.withColumn('v2', pandas_plus_one(sp_df.x)).show() Taking this example from here https://databricks.com/blog/2017/10/30

PySpark 2.4.5: IllegalArgumentException when using PandasUDF

阅读更多关于 PySpark 2.4.5: IllegalArgumentException when using PandasUDF

问题 I am trying Pandas UDF and facing the IllegalArgumentException. I also tried replicating examples from PySpark Documentation GroupedData to check but still getting the error. Following is the environment configuration python3.7 Installed PySpark==2.4.5 using pip Installed PyArrow==0.16.0 using pip from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf('int', PandasUDFType.GROUPED_AGG) def min_udf(v): return v.min() sorted(gdf.agg(min_udf(df.age)).collect()) Output

Pandas to parquet NOT into file-system but get content of resulting file in variable

阅读更多关于 Pandas to parquet NOT into file-system but get content of resulting file in variable

问题 There are several ways how a conversion from pandas to parquet is possible. e.g. pyarrow.Table.from_pandas or dataframe.to_parquet . What they have in common is that they get as a parameter a filePath where the df.parquet should be stored. I need to get the content of the written parquet file into a variable and have not seen this, yet. Mainly I want the same behavior as pandas.to_csv which returns the result as a string if no path is provided. Of course I could just write the file and read

Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

阅读更多关于 Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

问题 Using AWS Firehose I am converting incoming records to parquet. In one example, I have 150k identical records enter firehose, and a single 30kb parquet gets written to s3. Because of how firehose partitions data, we have a secondary process (lambda triggered by s3 put event) read in the parquet and repartitions it based on the date within the event itself. After this repartitioning process, the 30kb file size jumps to 900kb. Inspecting both parquet files- The meta doesn't change The data