pyspark-sql | 易学教程

Whats is the correct way to sum different dataframe columns in a list in pyspark?

阅读更多关于 Whats is the correct way to sum different dataframe columns in a list in pyspark?

问题 I want to sum different columns in a spark dataframe. Code from pyspark.sql import functions as F cols = ["A.p1","B.p1"] df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols) # 1. Works df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]])) #2. Doesnt work df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]])) #3. Doesnt work df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"]))) Why isn't approach #2. & #3. not working? I am on Spark

How to do mathematical operation with two column in dataframe using pyspark

阅读更多关于 How to do mathematical operation with two column in dataframe using pyspark

问题 I have dataframe with three column "x" ,"y" and "z" x y z bn 12452 221 mb 14521 330 pl 12563 160 lo 22516 142 I need to create a another column which is derived by this formula (m = z / y+z) So the new data frameshould look something like this: x y z m bn 12452 221 .01743 mb 14521 330 .02222 pl 12563 160 .01257 lo 22516 142 .00626 回答1: df = sqlContext.createDataFrame([('bn', 12452, 221), ('mb', 14521, 330)], ['x', 'y', 'z']) df = df.withColumn('m', df['z'] / (df['y'] + df['z'])) df.head(2) 来源

pyspark approxQuantile function

阅读更多关于 pyspark approxQuantile function

问题 I have dataframe with these columns id , price , timestamp . I would like to find median value grouped by id . I am using this code to find it but it's giving me this error. from pyspark.sql import DataFrameStatFunctions as statFunc windowSpec = Window.partitionBy("id") median = statFunc.approxQuantile("price", [0.5], 0) \ .over(windowSpec) return df.withColumn("Median", median) Is it not possible to use DataFrameStatFunctions to fill values in new column? TypeError: unbound method

How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?

阅读更多关于 How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?

问题 import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (0, 5, float(10)), (1, 6, float('nan')), (0, 6, float('nan'))], ('session', "timestamp1", "id2")) +-------+----------+----+ |session|timestamp1| id2| +-------+----------+----+ | 1| 1|null| | 1| 2| 5.0| | 1| 3| NaN| | 1| 4|null| | 0| 5|10.0| | 1| 6| NaN| | 0| 6| NaN| +-------+----------+----+ How to replace value of timestamp1 column with value 999 when session==0? Expected output +---

Spark ML Pipeline Causes java.lang.Exception: failed to compile … Code … grows beyond 64 KB

阅读更多关于 Spark ML Pipeline Causes java.lang.Exception: failed to compile … Code … grows beyond 64 KB

问题 Using Spark 2.0, I am trying to run a simple VectorAssembler in a pyspark ML pipeline, like so: feature_assembler = VectorAssembler(inputCols=['category_count', 'name_count'], \ outputCol="features") pipeline = Pipeline(stages=[feature_assembler]) model = pipeline.fit(df_train) model_output = model.transform(df_train) When I try to look at the output using model_output.select("features").show(1) I get the error Py4JJavaError Traceback (most recent call last) <ipython-input-95-7a3e3d4f281c> in

Pyspark replace NaN with NULL

阅读更多关于 Pyspark replace NaN with NULL

问题 I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL. I tried something like this: some_table = sql('SELECT * FROM some_table') some_table = some_table.na.fill(None) But I got the following error: ValueError: value should be a float, int, long, string, bool or dict So it seems like na.fill() doesn't support None. I specifically need to replace with NULL , not some other value, like 0

Pyspark DataFrame UDF on Text Column

阅读更多关于 Pyspark DataFrame UDF on Text Column

问题 I'm trying to do some NLP text clean up of some Unicode columns in a PySpark DataFrame. I've tried in Spark 1.3, 1.5 and 1.6 and can't seem to get things to work for the life of me. I've also tried using Python 2.7 and Python 3.4. I've created an extremely simple udf as seen below that should just return a string back for each record in a new column. Other functions will manipulate the text and then return the changed text back in a new column. import pyspark from pyspark.sql import

How to convert type Row into Vector to feed to the KMeans

阅读更多关于 How to convert type Row into Vector to feed to the KMeans

问题 when i try to feed df2 to kmeans i get the following error clusters = KMeans.train(df2, 10, maxIterations=30, runs=10, initializationMode="random") The error i get: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector df2 is a dataframe created as follow: df = sqlContext.read.json("data/ALS3.json") df2 = df.select('latitude','longitude') df2.show() latitude| longitude| 60.1643075| 24.9460844| 60.4686748| 22.2774728| how can i convert this two columns to Vector and feed it to KMeans

Pyspark convert a standard list to data frame [duplicate]

阅读更多关于 Pyspark convert a standard list to data frame [duplicate]

问题 This question already has an answer here : Create Spark DataFrame. Can not infer schema for type: <type 'float'> (1 answer) Closed last year . The case is really simple, I need to convert a python list into data frame with following code from pyspark.sql.types import StructType from pyspark.sql.types import StructField from pyspark.sql.types import StringType, IntegerType schema = StructType([StructField("value", IntegerType(), True)]) my_list = [1, 2, 3, 4] rdd = sc.parallelize(my_list) df =

Filter PySpark DataFrame by checking if string appears in column

阅读更多关于 Filter PySpark DataFrame by checking if string appears in column

问题 I'm new to Spark and playing around with filtering. I have a pyspark.sql DataFrame created by reading in a json file. A part of the schema is shown below: root |-- authors: array (nullable = true) | |-- element: string (containsNull = true) I would like to filter this DataFrame, selecting all of the rows with entries pertaining to a particular author. So whether this author is the first author listed in authors or the nth, the row should be included if their name appears. So something along