pyspark-dataframes

Convert string list to binary list in pyspark

孤人 提交于 2020-03-22 06:28:58
问题 I have a dataframe like this data = [(("ID1", ['October', 'September', 'August'])), (("ID2", ['August', 'June', 'May'])), (("ID3", ['October', 'June']))] df = spark.createDataFrame(data, ["ID", "MonthList"]) df.show(truncate=False) +---+----------------------------+ |ID |MonthList | +---+----------------------------+ |ID1|[October, September, August]| |ID2|[August, June, May] | |ID3|[October, June] | +---+----------------------------+ I want to compare every row with a default list, such that

Compare two datasets in pyspark

ぐ巨炮叔叔 提交于 2020-03-04 15:34:23
问题 I have 2 datasets. Example Dataset 1: id | model | first_name | last_name ----------------------------------------------------------- 1234 | 32 | 456765 | [456700,987565] ----------------------------------------------------------- 4539 | 20 | 123211 | [893456,123456] ----------------------------------------------------------- Some times one of the columns first_name and last_name is empty. Example dataset 2: number | matricule | name | model ---------------------------------------------------

Spark aggregations where output columns are functions and rows are columns

…衆ロ難τιáo~ 提交于 2020-02-25 05:06:45
问题 I want to compute a bunch of different agg functions on different columns in a dataframe. I know I can do something like this, but the output is all one row. df.agg(max("cola"), min("cola"), max("colb"), min("colb")) Let's say I will be performing 100 different aggregations on 10 different columns. I want the output dataframe to be like this - |Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc.. cola | 1 | 10| ... colb | 2 | NULL| ... colc | 5 | 20| ... cold | NULL | 42| ... ... Where my

Spark aggregations where output columns are functions and rows are columns

扶醉桌前 提交于 2020-02-25 05:06:27
问题 I want to compute a bunch of different agg functions on different columns in a dataframe. I know I can do something like this, but the output is all one row. df.agg(max("cola"), min("cola"), max("colb"), min("colb")) Let's say I will be performing 100 different aggregations on 10 different columns. I want the output dataframe to be like this - |Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc.. cola | 1 | 10| ... colb | 2 | NULL| ... colc | 5 | 20| ... cold | NULL | 42| ... ... Where my

How to pass a array column and convert it to a numpy array in pyspark

人走茶凉 提交于 2020-01-30 10:32:14
问题 I have a data frame like below: from pyspark import SparkContext, SparkConf,SQLContext import numpy as np from scipy.spatial.distance import cosine from pyspark.sql.functions import lit,countDistinct,udf,array,struct import pyspark.sql.functions as F config = SparkConf("local") sc = SparkContext(conf=config) sqlContext=SQLContext(sc) @udf("float") def myfunction(x): y=np.array([1,3,9]) x=np.array(x) return cosine(x,y) df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2"

How to pass a array column and convert it to a numpy array in pyspark

一曲冷凌霜 提交于 2020-01-30 10:31:08
问题 I have a data frame like below: from pyspark import SparkContext, SparkConf,SQLContext import numpy as np from scipy.spatial.distance import cosine from pyspark.sql.functions import lit,countDistinct,udf,array,struct import pyspark.sql.functions as F config = SparkConf("local") sc = SparkContext(conf=config) sqlContext=SQLContext(sc) @udf("float") def myfunction(x): y=np.array([1,3,9]) x=np.array(x) return cosine(x,y) df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2"

How to pass a array column and convert it to a numpy array in pyspark

孤人 提交于 2020-01-30 10:27:19
问题 I have a data frame like below: from pyspark import SparkContext, SparkConf,SQLContext import numpy as np from scipy.spatial.distance import cosine from pyspark.sql.functions import lit,countDistinct,udf,array,struct import pyspark.sql.functions as F config = SparkConf("local") sc = SparkContext(conf=config) sqlContext=SQLContext(sc) @udf("float") def myfunction(x): y=np.array([1,3,9]) x=np.array(x) return cosine(x,y) df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2"

Sum of array elements depending on value condition pyspark

a 夏天 提交于 2020-01-28 02:31:14
问题 I have a pyspark dataframe: id | column ------------------------------ 1 | [0.2, 2, 3, 4, 3, 0.5] ------------------------------ 2 | [7, 0.3, 0.3, 8, 2,] ------------------------------ I would like to create a 3 columns: Column 1 : contain the sum of the elements < 2 Column 2 : contain the sum of the elements > 2 Column 3 : contain the sum of the elements = 2 (some times I have duplicate values so I do their sum) In case if I don't have a values I put null. Expect result: id | column | column

py4j.protocol.Py4JJavaError: An error occurred while calling o788.save. : com.mongodb.MongoTimeoutException, WritableServerSelector

偶尔善良 提交于 2020-01-25 08:59:26
问题 Pyspark version: 2.4.4 MongoDB version: 4.2.0 RAM: 64GB CPU Core:32 running script: spark-submit --executor-memory 8G --driver-memory 8G --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.1 demographic.py when I run the code I am getting the error: "py4j.protocol.Py4JJavaError: An error occurred while calling o764.save. : com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

懵懂的女人 提交于 2020-01-25 06:48:25
问题 I have the same problem as asked here but I need a solution in pyspark and without breeze. For example if my pyspark dataframe look like this: user | weight | vec "u1" | 0.1 | [2, 4, 6] "u1" | 0.5 | [4, 8, 12] "u2" | 0.5 | [20, 40, 60] where column weight has type double and column vec has type Array[Double], I would like to get the weighted sum of the vectors per user, so that I get a dataframe that look like this: user | wsum "u1" | [2.2, 4.4, 6.6] "u2" | [10, 20, 30] To do this I have