pyspark-sql

py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment

孤者浪人 提交于 2019-12-10 23:19:10
问题 I'm trying to perform a simple task in spark dataframe (python) which is create new dataframe by selecting specific column and nested columns from another dataframe for example : df.printSchema() root |-- time_stamp: long (nullable = true) |-- country: struct (nullable = true) | |-- code: string (nullable = true) | |-- id: long (nullable = true) | |-- time_zone: string (nullable = true) |-- event_name: string (nullable = true) |-- order: struct (nullable = true) | |-- created_at: string

Converting RDD to Contingency Table: Pyspark

廉价感情. 提交于 2019-12-10 18:52:55
问题 Currently I am trying to convert an RDD to a contingency table in-order to use the pyspark.ml.clustering.KMeans module, which takes a dataframe as input. When I do myrdd.take(K) ,(where K is some number) the structure looks as follows: [[u'user1',('itm1',3),...,('itm2',1)], [u'user2',('itm1',7),..., ('itm2',4)],...,[u'usern',('itm2',2),...,('itm3',10)]] Where each list contains an entity as the first element and the set of all items and their counts that was liked by this entity in the form

How to resolve error "AttributeError: 'SparkSession' object has no attribute 'serializer'?

别等时光非礼了梦想. 提交于 2019-12-10 18:23:50
问题 I'm using pyspark dataframe. I have some code in which I'm trying to convert the dataframe to an rdd , but I receive the following error: AttributeError: 'SparkSession' object has no attribute 'serializer' What can be the issue? training, test = rescaledData.randomSplit([0.8, 0.2]) nb = NaiveBayes(smoothing=1.0, modelType="multinomial") # Train a naive Bayes model. model = nb.fit(rescaledData) # Make prediction and test accuracy. predictionAndLabel = test.rdd.map(lambda p: (model.predict(p

Python Spark DataFrame: replace null with SparseVector

╄→尐↘猪︶ㄣ 提交于 2019-12-10 18:19:48
问题 In spark, I have following data frame called "df" with some null entries: +-------+--------------------+--------------------+ | id| features1| features2| +-------+--------------------+--------------------+ | 185|(5,[0,1,4],[0.1,0...| null| | 220|(5,[0,2,3],[0.1,0...|(10,[1,2,6],[0.1,...| | 225| null|(10,[1,3,5],[0.1,...| +-------+--------------------+--------------------+ df.features1 and df.features2 are type vector (nullable). Then I tried to use following code to fill null entries with

how to merge two columns with a condition in pyspark?

荒凉一梦 提交于 2019-12-10 17:55:37
问题 I was able to merge and sort the values but unable to figure out the condition not to merge if the values are equal df = sqlContext.createDataFrame([("foo", "bar","too","aaa"), ("bar", "bar","aaa","foo")], ("k", "K" ,"v" ,"V")) columns = df.columns k = 0 for i in range(len(columns)): for j in range(i + 1, len(columns)): if columns[i].lower() == columns[j].lower(): k = k+1 df = (df.withColumn(columns[i]+str(k),concat(col(columns[i]),lit(","), col(columns[j])))) newdf = df.select( col("k")

pyspark, Compare two rows in dataframe

亡梦爱人 提交于 2019-12-10 17:38:27
问题 I'm attempting to compare one row in a dataframe with the next to see the difference in timestamp. Currently the data looks like: itemid | eventid | timestamp ---------------------------- 134 | 30 | 2016-07-02 12:01:40 134 | 32 | 2016-07-02 12:21:23 125 | 30 | 2016-07-02 13:22:56 125 | 32 | 2016-07-02 13:27:07 I've tried mapping a function onto the dataframe to allow for comparing like this: (note: I'm trying to get rows with a difference greater than 4 hours) items = df.limit(10)\ .orderBy(

How to derive Percentile using Spark Data frame and GroupBy in python

寵の児 提交于 2019-12-10 16:32:07
问题 I have a Spark dataframe which has Date , Group and Price columns. I'm trying to derive the percentile(0.6) for the Price column of that dataframe in Python. Besides, I need to add the output as a new column. I tried the code below: perudf = udf(lambda x: x.quantile(.6)) df1 = df.withColumn("Percentile", df.groupBy("group").agg("group"),perudf('price')) but it is throwing the following error: assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column" AssertionError: all

How do you add a numpy.array as a new column to a pyspark.SQL DataFrame?

三世轮回 提交于 2019-12-10 15:25:10
问题 Here is the code to create a pyspark.sql DataFrame import numpy as np import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext df = pd.DataFrame(np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]]), columns=['a','b','c']) sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1) So that sparkdf looks like a b c 1 2 3 4 5 6 7 8 9 10 11 12 Now I would like to add as a new column a numpy array (or even a list) new_col = np.array([20,20,20,20]) But the standard way

What does df.repartition with no column arguments partition on?

僤鯓⒐⒋嵵緔 提交于 2019-12-10 14:47:27
问题 In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key. My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself. def repartition(self, numPartitions, *cols): """ Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned. :param numPartitions: can be an

saving a list of rows to a Hive table in pyspark

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-10 13:45:41
问题 I have a pyspark app. I copied a hive table to my hdfs directory, & in python I sqlContext.sql a query on this table. Now this variable is a dataframe I call rows . I need to randomly shuffle the rows , so I had to convert them to a list of rows rows_list = rows.collect() . So then I shuffle(rows_list) which shuffles the lists in place. I take the amount of random rows I need x : for r in range(x): allrows2add.append(rows_list[r]) Now I want to save allrows2add as a hive table OR append an