pyspark-sql | 易学教程

Memory leaks when using pandas_udf and Parquet serialization?

阅读更多关于 Memory leaks when using pandas_udf and Parquet serialization?

问题 I am currently developing my first whole system using PySpark and I am running into some strange, memory-related issues. In one of the stages, I would like to resemble a Split-Apply-Combine strategy in order to modify a DataFrame. That is, I would like to apply a function to each of the groups defined by a given column and finally combine them all. Problem is, the function I want to apply is a prediction method for a fitted model that "speaks" the Pandas idiom, i.e., it is vectorized and

Flatten Nested Struct in PySpark Array

阅读更多关于 Flatten Nested Struct in PySpark Array

How to match/extract multi-line pattern from file in pysark

阅读更多关于 How to match/extract multi-line pattern from file in pysark

问题 I have a huge file of rdf triplets (subject predicate objects) as shown in the image below. The goals it extract the bold items and have the following output Item_Id | quantityAmount | quantityUnit | rank ----------------------------------------------- Q31 24954 Meter BestRank Q25 582 Kilometer NormalRank I want to extract lines that follow the following pattern subject is given a pointer ( <Q31> <prop/P1082> <Pointer_Q31-87RF> . ) Pointer has a ranking ( <Pointer_Q31-87RF> <rank> <BestRank>

How to speed up spark df.write jdbc to postgres database?

阅读更多关于 How to speed up spark df.write jdbc to postgres database?

问题 I am new to spark and am attempting to speed up appending the contents of a dataframe, (that can have between 200k and 2M rows) to a postgres database using df.write: df.write.format('jdbc').options( url=psql_url_spark, driver=spark_env['PSQL_DRIVER'], dbtable="{schema}.{table}".format(schema=schema, table=table), user=spark_env['PSQL_USER'], password=spark_env['PSQL_PASS'], batchsize=2000000, queryTimeout=690 ).mode(mode).save() I tried increasing the batchsize but that didn't help, as

PySpark and time series data: how to smartly avoid overlapping dates?

阅读更多关于 PySpark and time series data: how to smartly avoid overlapping dates?

问题 I have the following sample Spark dataframe import pandas as pd import pyspark import pyspark.sql.functions as fn from pyspark.sql.window import Window raw_df = pd.DataFrame([ (1115, dt.datetime(2019,8,5,18,20), dt.datetime(2019,8,5,18,40)), (484, dt.datetime(2019,8,5,18,30), dt.datetime(2019,8,9,18,40)), (484, dt.datetime(2019,8,4,18,30), dt.datetime(2019,8,6,18,40)), (484, dt.datetime(2019,8,2,18,30), dt.datetime(2019,8,3,18,40)), (484, dt.datetime(2019,8,7,18,50), dt.datetime(2019,8,9,18

How to create BinaryType Column using multiple columns of a pySpark Dataframe?

阅读更多关于 How to create BinaryType Column using multiple columns of a pySpark Dataframe?

问题 I have recently started working with pySpark so don't know about many details regarding this. I am trying to create a BinaryType column in a data frame? But struggling to do it... for example, let's take a simple df df.show(2) +---+----------+ | col1|col2| +---+----------+ | "1"| null| | "2"| "20"| +---+----------+ Now I want to have a third column "col3" with BinaryType like | col1|col2| col3| +---+----------+ | "1"| null|[1 null] | "2"| "20"|[ 2 20] +---+----------+ How should i do it? 回答1:

Read CSV file in pyspark with ANSI encoding

阅读更多关于 Read CSV file in pyspark with ANSI encoding

问题 I am trying to read in a csv/text file that requires it to be read in using ANSI encoding. However this is not working. Any ideas? mainDF= spark.read.format("csv")\ .option("encoding","ANSI")\ .option("header","true")\ .option("maxRowsInMemory",1000)\ .option("inferSchema","false")\ .option("delimiter", "¬")\ .load(path) java.nio.charset.UnsupportedCharsetException: ANSI The file is over 5GB hence the spark requirement. I have also tried ANSI in lower case 回答1: ISO-8859-1 is the same as ANSI

PySpark and time series data: how to smartly avoid overlapping dates?

阅读更多关于 PySpark and time series data: how to smartly avoid overlapping dates?

Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

阅读更多关于 Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

问题 Say I have a list of column names and they all exist in the dataframe Cols = ['A', 'B', 'C', 'D'], I am looking for a quick way to get a table/dataframe like NA_counts min max A 5 0 100 B 10 0 120 C 8 1 99 D 2 0 500 TIA 回答1: You can calculate each metric separately and then union all like this: nulls_cols = [sum(when(col(c).isNull(), lit(1)).otherwise(lit(0))).alias(c) for c in cols] max_cols = [max(col(c)).alias(c) for c in cols] min_cols = [min(col(c)).alias(c) for c in cols] nulls_df = df

Rename Column in Athena

阅读更多关于 Rename Column in Athena

问题 Athena table "organization" reads data from parquet files in s3. I need to change a column name from "cost" to "fee" . The data files goes back to Jan 2018. If I just rename the column in Athena , table won't be able to find data for new column in parquet file. Please let me know if there ways to resolve it. 回答1: You have to change the schema and point to new column "fee" But it depends on ur situation. If you have two data sets, in one dataset it is called "cost" and in another dataset it is