pyspark-sql

Whats is the correct way to sum different dataframe columns in a list in pyspark?

南笙酒味 提交于 2020-01-01 11:57:30
问题 I want to sum different columns in a spark dataframe. Code from pyspark.sql import functions as F cols = ["A.p1","B.p1"] df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols) # 1. Works df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]])) #2. Doesnt work df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]])) #3. Doesnt work df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"]))) Why isn't approach #2. & #3. not working? I am on Spark

How to do mathematical operation with two column in dataframe using pyspark

拈花ヽ惹草 提交于 2020-01-01 05:40:32
问题 I have dataframe with three column "x" ,"y" and "z" x y z bn 12452 221 mb 14521 330 pl 12563 160 lo 22516 142 I need to create a another column which is derived by this formula (m = z / y+z) So the new data frameshould look something like this: x y z m bn 12452 221 .01743 mb 14521 330 .02222 pl 12563 160 .01257 lo 22516 142 .00626 回答1: df = sqlContext.createDataFrame([('bn', 12452, 221), ('mb', 14521, 330)], ['x', 'y', 'z']) df = df.withColumn('m', df['z'] / (df['y'] + df['z'])) df.head(2) 来源

pyspark approxQuantile function

為{幸葍}努か 提交于 2020-01-01 03:10:50
问题 I have dataframe with these columns id , price , timestamp . I would like to find median value grouped by id . I am using this code to find it but it's giving me this error. from pyspark.sql import DataFrameStatFunctions as statFunc windowSpec = Window.partitionBy("id") median = statFunc.approxQuantile("price", [0.5], 0) \ .over(windowSpec) return df.withColumn("Median", median) Is it not possible to use DataFrameStatFunctions to fill values in new column? TypeError: unbound method

How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?

こ雲淡風輕ζ 提交于 2019-12-31 10:18:21
问题 import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (0, 5, float(10)), (1, 6, float('nan')), (0, 6, float('nan'))], ('session', "timestamp1", "id2")) +-------+----------+----+ |session|timestamp1| id2| +-------+----------+----+ | 1| 1|null| | 1| 2| 5.0| | 1| 3| NaN| | 1| 4|null| | 0| 5|10.0| | 1| 6| NaN| | 0| 6| NaN| +-------+----------+----+ How to replace value of timestamp1 column with value 999 when session==0? Expected output +---

Spark ML Pipeline Causes java.lang.Exception: failed to compile … Code … grows beyond 64 KB

耗尽温柔 提交于 2019-12-30 18:55:34
问题 Using Spark 2.0, I am trying to run a simple VectorAssembler in a pyspark ML pipeline, like so: feature_assembler = VectorAssembler(inputCols=['category_count', 'name_count'], \ outputCol="features") pipeline = Pipeline(stages=[feature_assembler]) model = pipeline.fit(df_train) model_output = model.transform(df_train) When I try to look at the output using model_output.select("features").show(1) I get the error Py4JJavaError Traceback (most recent call last) <ipython-input-95-7a3e3d4f281c> in

Pyspark replace NaN with NULL

前提是你 提交于 2019-12-30 11:31:28
问题 I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL. I tried something like this: some_table = sql('SELECT * FROM some_table') some_table = some_table.na.fill(None) But I got the following error: ValueError: value should be a float, int, long, string, bool or dict So it seems like na.fill() doesn't support None. I specifically need to replace with NULL , not some other value, like 0

Pyspark DataFrame UDF on Text Column

帅比萌擦擦* 提交于 2019-12-30 01:58:04
问题 I'm trying to do some NLP text clean up of some Unicode columns in a PySpark DataFrame. I've tried in Spark 1.3, 1.5 and 1.6 and can't seem to get things to work for the life of me. I've also tried using Python 2.7 and Python 3.4. I've created an extremely simple udf as seen below that should just return a string back for each record in a new column. Other functions will manipulate the text and then return the changed text back in a new column. import pyspark from pyspark.sql import

How to convert type Row into Vector to feed to the KMeans

泄露秘密 提交于 2019-12-30 00:39:49
问题 when i try to feed df2 to kmeans i get the following error clusters = KMeans.train(df2, 10, maxIterations=30, runs=10, initializationMode="random") The error i get: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector df2 is a dataframe created as follow: df = sqlContext.read.json("data/ALS3.json") df2 = df.select('latitude','longitude') df2.show() latitude| longitude| 60.1643075| 24.9460844| 60.4686748| 22.2774728| how can i convert this two columns to Vector and feed it to KMeans

Pyspark convert a standard list to data frame [duplicate]

走远了吗. 提交于 2019-12-30 00:35:49
问题 This question already has an answer here : Create Spark DataFrame. Can not infer schema for type: <type 'float'> (1 answer) Closed last year . The case is really simple, I need to convert a python list into data frame with following code from pyspark.sql.types import StructType from pyspark.sql.types import StructField from pyspark.sql.types import StringType, IntegerType schema = StructType([StructField("value", IntegerType(), True)]) my_list = [1, 2, 3, 4] rdd = sc.parallelize(my_list) df =

Filter PySpark DataFrame by checking if string appears in column

此生再无相见时 提交于 2019-12-29 09:03:57
问题 I'm new to Spark and playing around with filtering. I have a pyspark.sql DataFrame created by reading in a json file. A part of the schema is shown below: root |-- authors: array (nullable = true) | |-- element: string (containsNull = true) I would like to filter this DataFrame, selecting all of the rows with entries pertaining to a particular author. So whether this author is the first author listed in authors or the nth, the row should be included if their name appears. So something along