pyspark-sql

Memory leaks when using pandas_udf and Parquet serialization?

╄→гoц情女王★ 提交于 2021-02-06 10:15:05
问题 I am currently developing my first whole system using PySpark and I am running into some strange, memory-related issues. In one of the stages, I would like to resemble a Split-Apply-Combine strategy in order to modify a DataFrame. That is, I would like to apply a function to each of the groups defined by a given column and finally combine them all. Problem is, the function I want to apply is a prediction method for a fitted model that "speaks" the Pandas idiom, i.e., it is vectorized and

Flatten Nested Struct in PySpark Array

跟風遠走 提交于 2021-02-04 16:37:26
问题 Given a schema like: root |-- first_name: string |-- last_name: string |-- degrees: array | |-- element: struct | | |-- school: string | | |-- advisors: struct | | | |-- advisor1: string | | | |-- advisor2: string How can I get a schema like: root |-- first_name: string |-- last_name: string |-- degrees: array | |-- element: struct | | |-- school: string | | |-- advisor1: string | | |-- advisor2: string Currently, I explode the array, flatten the structure by selecting advisor.* and then

How to match/extract multi-line pattern from file in pysark

帅比萌擦擦* 提交于 2021-02-04 15:51:50
问题 I have a huge file of rdf triplets (subject predicate objects) as shown in the image below. The goals it extract the bold items and have the following output Item_Id | quantityAmount | quantityUnit | rank ----------------------------------------------- Q31 24954 Meter BestRank Q25 582 Kilometer NormalRank I want to extract lines that follow the following pattern subject is given a pointer ( <Q31> <prop/P1082> <Pointer_Q31-87RF> . ) Pointer has a ranking ( <Pointer_Q31-87RF> <rank> <BestRank>

How to speed up spark df.write jdbc to postgres database?

最后都变了- 提交于 2021-02-04 12:16:14
问题 I am new to spark and am attempting to speed up appending the contents of a dataframe, (that can have between 200k and 2M rows) to a postgres database using df.write: df.write.format('jdbc').options( url=psql_url_spark, driver=spark_env['PSQL_DRIVER'], dbtable="{schema}.{table}".format(schema=schema, table=table), user=spark_env['PSQL_USER'], password=spark_env['PSQL_PASS'], batchsize=2000000, queryTimeout=690 ).mode(mode).save() I tried increasing the batchsize but that didn't help, as

PySpark and time series data: how to smartly avoid overlapping dates?

走远了吗. 提交于 2021-01-29 18:40:26
问题 I have the following sample Spark dataframe import pandas as pd import pyspark import pyspark.sql.functions as fn from pyspark.sql.window import Window raw_df = pd.DataFrame([ (1115, dt.datetime(2019,8,5,18,20), dt.datetime(2019,8,5,18,40)), (484, dt.datetime(2019,8,5,18,30), dt.datetime(2019,8,9,18,40)), (484, dt.datetime(2019,8,4,18,30), dt.datetime(2019,8,6,18,40)), (484, dt.datetime(2019,8,2,18,30), dt.datetime(2019,8,3,18,40)), (484, dt.datetime(2019,8,7,18,50), dt.datetime(2019,8,9,18

How to create BinaryType Column using multiple columns of a pySpark Dataframe?

血红的双手。 提交于 2021-01-29 17:53:58
问题 I have recently started working with pySpark so don't know about many details regarding this. I am trying to create a BinaryType column in a data frame? But struggling to do it... for example, let's take a simple df df.show(2) +---+----------+ | col1|col2| +---+----------+ | "1"| null| | "2"| "20"| +---+----------+ Now I want to have a third column "col3" with BinaryType like | col1|col2| col3| +---+----------+ | "1"| null|[1 null] | "2"| "20"|[ 2 20] +---+----------+ How should i do it? 回答1:

Read CSV file in pyspark with ANSI encoding

荒凉一梦 提交于 2021-01-29 13:25:54
问题 I am trying to read in a csv/text file that requires it to be read in using ANSI encoding. However this is not working. Any ideas? mainDF= spark.read.format("csv")\ .option("encoding","ANSI")\ .option("header","true")\ .option("maxRowsInMemory",1000)\ .option("inferSchema","false")\ .option("delimiter", "¬")\ .load(path) java.nio.charset.UnsupportedCharsetException: ANSI The file is over 5GB hence the spark requirement. I have also tried ANSI in lower case 回答1: ISO-8859-1 is the same as ANSI

PySpark and time series data: how to smartly avoid overlapping dates?

谁都会走 提交于 2021-01-29 12:06:08
问题 I have the following sample Spark dataframe import pandas as pd import pyspark import pyspark.sql.functions as fn from pyspark.sql.window import Window raw_df = pd.DataFrame([ (1115, dt.datetime(2019,8,5,18,20), dt.datetime(2019,8,5,18,40)), (484, dt.datetime(2019,8,5,18,30), dt.datetime(2019,8,9,18,40)), (484, dt.datetime(2019,8,4,18,30), dt.datetime(2019,8,6,18,40)), (484, dt.datetime(2019,8,2,18,30), dt.datetime(2019,8,3,18,40)), (484, dt.datetime(2019,8,7,18,50), dt.datetime(2019,8,9,18

Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

狂风中的少年 提交于 2021-01-29 11:19:31
问题 Say I have a list of column names and they all exist in the dataframe Cols = ['A', 'B', 'C', 'D'], I am looking for a quick way to get a table/dataframe like NA_counts min max A 5 0 100 B 10 0 120 C 8 1 99 D 2 0 500 TIA 回答1: You can calculate each metric separately and then union all like this: nulls_cols = [sum(when(col(c).isNull(), lit(1)).otherwise(lit(0))).alias(c) for c in cols] max_cols = [max(col(c)).alias(c) for c in cols] min_cols = [min(col(c)).alias(c) for c in cols] nulls_df = df

Rename Column in Athena

时光总嘲笑我的痴心妄想 提交于 2021-01-28 14:27:05
问题 Athena table "organization" reads data from parquet files in s3. I need to change a column name from "cost" to "fee" . The data files goes back to Jan 2018. If I just rename the column in Athena , table won't be able to find data for new column in parquet file. Please let me know if there ways to resolve it. 回答1: You have to change the schema and point to new column "fee" But it depends on ur situation. If you have two data sets, in one dataset it is called "cost" and in another dataset it is