pyspark-dataframes | 易学教程

Convert string list to binary list in pyspark

阅读更多关于 Convert string list to binary list in pyspark

问题 I have a dataframe like this data = [(("ID1", ['October', 'September', 'August'])), (("ID2", ['August', 'June', 'May'])), (("ID3", ['October', 'June']))] df = spark.createDataFrame(data, ["ID", "MonthList"]) df.show(truncate=False) +---+----------------------------+ |ID |MonthList | +---+----------------------------+ |ID1|[October, September, August]| |ID2|[August, June, May] | |ID3|[October, June] | +---+----------------------------+ I want to compare every row with a default list, such that

Compare two datasets in pyspark

阅读更多关于 Compare two datasets in pyspark

Spark aggregations where output columns are functions and rows are columns

阅读更多关于 Spark aggregations where output columns are functions and rows are columns

问题 I want to compute a bunch of different agg functions on different columns in a dataframe. I know I can do something like this, but the output is all one row. df.agg(max("cola"), min("cola"), max("colb"), min("colb")) Let's say I will be performing 100 different aggregations on 10 different columns. I want the output dataframe to be like this - |Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc.. cola | 1 | 10| ... colb | 2 | NULL| ... colc | 5 | 20| ... cold | NULL | 42| ... ... Where my

Spark aggregations where output columns are functions and rows are columns

阅读更多关于 Spark aggregations where output columns are functions and rows are columns

How to pass a array column and convert it to a numpy array in pyspark

阅读更多关于 How to pass a array column and convert it to a numpy array in pyspark

问题 I have a data frame like below: from pyspark import SparkContext, SparkConf,SQLContext import numpy as np from scipy.spatial.distance import cosine from pyspark.sql.functions import lit,countDistinct,udf,array,struct import pyspark.sql.functions as F config = SparkConf("local") sc = SparkContext(conf=config) sqlContext=SQLContext(sc) @udf("float") def myfunction(x): y=np.array([1,3,9]) x=np.array(x) return cosine(x,y) df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2"

How to pass a array column and convert it to a numpy array in pyspark

阅读更多关于 How to pass a array column and convert it to a numpy array in pyspark

How to pass a array column and convert it to a numpy array in pyspark

阅读更多关于 How to pass a array column and convert it to a numpy array in pyspark

Sum of array elements depending on value condition pyspark

阅读更多关于 Sum of array elements depending on value condition pyspark

问题 I have a pyspark dataframe: id | column ------------------------------ 1 | [0.2, 2, 3, 4, 3, 0.5] ------------------------------ 2 | [7, 0.3, 0.3, 8, 2,] ------------------------------ I would like to create a 3 columns: Column 1 : contain the sum of the elements < 2 Column 2 : contain the sum of the elements > 2 Column 3 : contain the sum of the elements = 2 (some times I have duplicate values so I do their sum) In case if I don't have a values I put null. Expect result: id | column | column

py4j.protocol.Py4JJavaError: An error occurred while calling o788.save. : com.mongodb.MongoTimeoutException, WritableServerSelector

阅读更多关于 py4j.protocol.Py4JJavaError: An error occurred while calling o788.save. : com.mongodb.MongoTimeoutException, WritableServerSelector

问题 Pyspark version: 2.4.4 MongoDB version: 4.2.0 RAM: 64GB CPU Core:32 running script: spark-submit --executor-memory 8G --driver-memory 8G --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.1 demographic.py when I run the code I am getting the error: "py4j.protocol.Py4JJavaError: An error occurred while calling o764.save. : com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

阅读更多关于 Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

问题 I have the same problem as asked here but I need a solution in pyspark and without breeze. For example if my pyspark dataframe look like this: user | weight | vec "u1" | 0.1 | [2, 4, 6] "u1" | 0.5 | [4, 8, 12] "u2" | 0.5 | [20, 40, 60] where column weight has type double and column vec has type Array[Double], I would like to get the weighted sum of the vectors per user, so that I get a dataframe that look like this: user | wsum "u1" | [2.2, 4.4, 6.6] "u2" | [10, 20, 30] To do this I have