user-defined-functions | 易学教程

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

阅读更多关于 How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

问题 I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer I'm fairly sure this is because one of the groups the pandas UDF receives is huge, and if I reduce the dataset and removes enough rows I can run my UDF with no problems. However, I want to run with my original dataset and even if I run this spark job on a machine with

How to use Pandas UDF Functionality in pyspark

阅读更多关于 How to use Pandas UDF Functionality in pyspark

问题 I have a spark frame with two columns which looks like: +-------------------------------------------------------------+------------------------------------+ |docId |id | +-------------------------------------------------------------+------------------------------------+ |DYSDG6-RTB-91d663dd-949e-45da-94dd-e604b6050cb5-1537142434000|91d663dd-949e-45da-94dd-e604b6050cb5| |VAVLS7-RTB-8e2c1917-0d6b-419b-a59e-cd4acc255bb7-1537142445000|8e2c1917-0d6b-419b-a59e-cd4acc255bb7| |VAVLS7-RTB-c818dcde

Variant Array Custom Function Google Sheets? VBA For Example

阅读更多关于 Variant Array Custom Function Google Sheets? VBA For Example

问题 The following function below will add "1" to every column across the excel sheet. If I put =vbe(12) in A1, it will put "1" in columns "A1:L1". How can I translate this VBA to JavaScript for Google Sheets? Function vbe(Num As Long) As Variant Dim ary As Variant Dim i As Long ReDim ary(Num - 1) For i = 0 To Num - 1 ary(i) = 1 Next i vbe = ary End Function 回答1: You can write a custom formula that creates an array of "1"s with the length as a specified parameter, e.g. function myFunction

Scala — Conditional replace column value of a data frame

阅读更多关于 Scala — Conditional replace column value of a data frame

问题 DataFrame 1 is what I have now, and I want to write a Scala function to make DataFrame 1 look like DataFrame 2. Transfer is the big category; e-transfer and IMT are subcategories. The Logic is that for a same ID (31898), if both Transfer and e-Transfer tagged to it, it should only be e-Transfer; if Transfer and IMT and e-Transfer all tagged to a same ID (32614), it should be e-Transfer + IMT; If only Transfer tagged to one ID (33987), it should be Other; if only e-Transfer or IMT tagged to a

Adding a weight constraint to Max Sharpe Ratio function in python

阅读更多关于 Adding a weight constraint to Max Sharpe Ratio function in python

问题 I have the following formula to calculate Max Sharpe Ratio for a given set of returns: def msr(riskfree_rate, er, cov): """ Returns the weights of the portfolio that gives you the maximum sharpe ratio given the riskfree rate and expected returns and a covariance matrix """ n = er.shape[0] init_guess = np.repeat(1/n, n) bounds = ((0.0, 1.0),) * n # an N-tuple of 2-tuples! # construct the constraints weights_sum_to_1 = {'type': 'eq', 'fun': lambda weights: np.sum(weights) - 1 } def neg_sharpe

Timezone conversion with pyspark from timestamp and country

阅读更多关于 Timezone conversion with pyspark from timestamp and country

问题 I'm trying to convert UTC date to date with local timezone (using the country) with PySpark. I have the country as string and the date as timestamp So the input is : date = Timestamp('2016-11-18 01:45:55') # type is pandas._libs.tslibs.timestamps.Timestamp country = "FR" # Type is string import pytz import pandas as pd def convert_date_spark(date, country): timezone = pytz.country_timezones(country)[0] local_time = date.replace(tzinfo = pytz.utc).astimezone(timezone) date, time = local_time

Use external library in pandas_udf in pyspark

阅读更多关于 Use external library in pandas_udf in pyspark

问题 It's possible to use a external library like textdistance inside pandas_udf? I have tried and I get this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I have tried with Spark version 2.3.1. 回答1: You can package the textdistance together with your own code (use setup.py and bdist_egg to build an egg file), and specify the final package with option --py-files while you run spark. btw, the error message doesn't seem to relate

Use external library in pandas_udf in pyspark

阅读更多关于 Use external library in pandas_udf in pyspark

Checking whether a column has proper decimal number

阅读更多关于 Checking whether a column has proper decimal number

问题 I have a dataframe ( input_dataframe ), which looks like as below: id test_column 1 0.25 2 1.1 3 12 4 test 5 1.3334 6 .11 I want to add a column result , which put values 1 if test_column has a decimal value and 0 if test_column has any other value. data type of test_column is string. Below is the expected output: id test_column result 1 0.25 1 2 1.1 1 3 12 0 4 test 0 5 1.3334 1 6 .11 1 Can we achieve it using pySpark code? 回答1: You can parse decimal token with decimal.Decimal() Here we are

Implicit schema for pandas_udf in PySpark?

阅读更多关于 Implicit schema for pandas_udf in PySpark?

问题 This answer nicely explains how to use pyspark's groupby and pandas_udf to do custom aggregations. However, I cannot possibly declare my schema manually as shown in this part of the example from pyspark.sql.types import * schema = StructType([ StructField("key", StringType()), StructField("avg_min", DoubleType()) ]) since I will be returning 100+ columns with names that are automatically generated. Is there any way to tell PySpark to just implicetely use the Schema returned by my function and