user-defined-functions

Implicit schema for pandas_udf in PySpark?

◇◆丶佛笑我妖孽 提交于 2021-01-27 17:31:09
问题 This answer nicely explains how to use pyspark's groupby and pandas_udf to do custom aggregations. However, I cannot possibly declare my schema manually as shown in this part of the example from pyspark.sql.types import * schema = StructType([ StructField("key", StringType()), StructField("avg_min", DoubleType()) ]) since I will be returning 100+ columns with names that are automatically generated. Is there any way to tell PySpark to just implicetely use the Schema returned by my function and

How to determine the number of days in a month in SQL Server?

此生再无相见时 提交于 2021-01-20 13:41:52
问题 I need to determine the number of days in a month for a given date in SQL Server. Is there a built-in function? If not, what should I use as the user-defined function? 回答1: You can use the following with the first day of the specified month: datediff(day, @date, dateadd(month, 1, @date)) To make it work for every date: datediff(day, dateadd(day, 1-day(@date), @date), dateadd(month, 1, dateadd(day, 1-day(@date), @date))) 回答2: In SQL Server 2012 you can use EOMONTH (Transact-SQL) to get the

How to calculate difference between dates excluding weekends in Pyspark 2.2.0

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-07 06:50:49
问题 I have the below pyspark df which can be recreated by the code df = spark.createDataFrame([(1, "John Doe", "2020-11-30"),(2, "John Doe", "2020-11-27"),(3, "John Doe", "2020-11-29")], ("id", "name", "date")) +---+--------+----------+ | id| name| date| +---+--------+----------+ | 1|John Doe|2020-11-30| | 2|John Doe|2020-11-27| | 3|John Doe|2020-11-29| +---+--------+----------+ I am looking to create a udf to calculate difference between 2 rows of dates (using Lag function) excluding weekends as

PySpark string syntax error on UDF that returns MapType(StringType(), StringType())

假如想象 提交于 2021-01-07 01:29:08
问题 I'm getting the following syntax error: pyspark.sql.utils.AnalysisException: syntax error in attribute name: No I am not.; When performing some aspect sentiment classification on the text column of a Spark dataframe df_text that looks more or less like the following: index id text 1995 ev0oyrq [sign up]( 2014 eugwxff No I am not. 2675 g9f914q It’s hard for her to move around and even sit down, hard for her to walk and squeeze her hands. She hunches now. 1310 echja0g Thank you! 2727 gc725t2

How to correctly transform spark dataframe by mapInPandas

随声附和 提交于 2021-01-06 03:51:57
问题 I'm trying to transform spark dataframe with 10k rows by latest spark 3.0.1 function mapInPandas. Expected output: mapped pandas_function() transforms one row to three, so output transformed_df should have 30k rows Current output: I'm getting 3 rows with 1 core and 24 rows with 8 cores. INPUT: respond_sdf has 10k rows +-----+-------------------------------------------------------------------+ |url |content | +-----+-------------------------------------------------------------------+ |api_1|{

How to correctly transform spark dataframe by mapInPandas

倖福魔咒の 提交于 2021-01-06 03:42:32
问题 I'm trying to transform spark dataframe with 10k rows by latest spark 3.0.1 function mapInPandas. Expected output: mapped pandas_function() transforms one row to three, so output transformed_df should have 30k rows Current output: I'm getting 3 rows with 1 core and 24 rows with 8 cores. INPUT: respond_sdf has 10k rows +-----+-------------------------------------------------------------------+ |url |content | +-----+-------------------------------------------------------------------+ |api_1|{

How to correctly transform spark dataframe by mapInPandas

♀尐吖头ヾ 提交于 2021-01-06 03:42:06
问题 I'm trying to transform spark dataframe with 10k rows by latest spark 3.0.1 function mapInPandas. Expected output: mapped pandas_function() transforms one row to three, so output transformed_df should have 30k rows Current output: I'm getting 3 rows with 1 core and 24 rows with 8 cores. INPUT: respond_sdf has 10k rows +-----+-------------------------------------------------------------------+ |url |content | +-----+-------------------------------------------------------------------+ |api_1|{

How to yield pandas dataframe rows to spark dataframe

泄露秘密 提交于 2021-01-01 08:10:36
问题 Hi I'm making transformation, I have created some_function(iter) generator to yield Row(id=index, api=row['api'], A=row['A'], B=row['B'] to yield transformed rows from pandas dataframe to rdd and to spark dataframe. I'm getting errors. (I must use pandas to transform data as there is a large amount of legacy code) Input Spark DataFrame respond_sdf.show() +-------------------------------------------------------------------+ |content | +----------------------------------------------------------

Using tensorflow.keras model in pyspark UDF generates a pickle error

最后都变了- 提交于 2021-01-01 07:02:47
问题 I would like to use a tensorflow.keras model in a pysark pandas_udf. However, I get a pickle error when the model is being serialized before sending it to the workers. I am not sure I am using the best method to perform what I want, therefore I will expose a minimal but complete example. Packages: tensorflow-2.2.0 (but error is triggered to all previous versions too) pyspark-2.4.5 The import statements are: import pandas as pd import numpy as np from tensorflow.keras.models import Sequential

Change Font from returne value of UDF in VBA using Range.Characters property

删除回忆录丶 提交于 2020-12-27 06:32:45
问题 I've written a user defined function and want to change the font format of a defined character range, of the return value. It doesn´t seem to work the way i expekt, for cells with functions "= ..." . I only got 2 scenarios, first formated the whole return value and second doesn't format anything. For "normal" cells, is works, as you can see in the screenshot. try to change font format of first char to purple: top cell with function, botton cell without function: Anyone have an idea, how to do