pyspark-sql | 易学教程

py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment

阅读更多关于 py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment

问题 I'm trying to perform a simple task in spark dataframe (python) which is create new dataframe by selecting specific column and nested columns from another dataframe for example : df.printSchema() root |-- time_stamp: long (nullable = true) |-- country: struct (nullable = true) | |-- code: string (nullable = true) | |-- id: long (nullable = true) | |-- time_zone: string (nullable = true) |-- event_name: string (nullable = true) |-- order: struct (nullable = true) | |-- created_at: string

Converting RDD to Contingency Table: Pyspark

阅读更多关于 Converting RDD to Contingency Table: Pyspark

问题 Currently I am trying to convert an RDD to a contingency table in-order to use the pyspark.ml.clustering.KMeans module, which takes a dataframe as input. When I do myrdd.take(K) ,(where K is some number) the structure looks as follows: [[u'user1',('itm1',3),...,('itm2',1)], [u'user2',('itm1',7),..., ('itm2',4)],...,[u'usern',('itm2',2),...,('itm3',10)]] Where each list contains an entity as the first element and the set of all items and their counts that was liked by this entity in the form

How to resolve error "AttributeError: 'SparkSession' object has no attribute 'serializer'?

阅读更多关于 How to resolve error "AttributeError: 'SparkSession' object has no attribute 'serializer'?

问题 I'm using pyspark dataframe. I have some code in which I'm trying to convert the dataframe to an rdd , but I receive the following error: AttributeError: 'SparkSession' object has no attribute 'serializer' What can be the issue? training, test = rescaledData.randomSplit([0.8, 0.2]) nb = NaiveBayes(smoothing=1.0, modelType="multinomial") # Train a naive Bayes model. model = nb.fit(rescaledData) # Make prediction and test accuracy. predictionAndLabel = test.rdd.map(lambda p: (model.predict(p

Python Spark DataFrame: replace null with SparseVector

阅读更多关于 Python Spark DataFrame: replace null with SparseVector

问题 In spark, I have following data frame called "df" with some null entries: +-------+--------------------+--------------------+ | id| features1| features2| +-------+--------------------+--------------------+ | 185|(5,[0,1,4],[0.1,0...| null| | 220|(5,[0,2,3],[0.1,0...|(10,[1,2,6],[0.1,...| | 225| null|(10,[1,3,5],[0.1,...| +-------+--------------------+--------------------+ df.features1 and df.features2 are type vector (nullable). Then I tried to use following code to fill null entries with

how to merge two columns with a condition in pyspark?

阅读更多关于 how to merge two columns with a condition in pyspark?

问题 I was able to merge and sort the values but unable to figure out the condition not to merge if the values are equal df = sqlContext.createDataFrame([("foo", "bar","too","aaa"), ("bar", "bar","aaa","foo")], ("k", "K" ,"v" ,"V")) columns = df.columns k = 0 for i in range(len(columns)): for j in range(i + 1, len(columns)): if columns[i].lower() == columns[j].lower(): k = k+1 df = (df.withColumn(columns[i]+str(k),concat(col(columns[i]),lit(","), col(columns[j])))) newdf = df.select( col("k")

pyspark, Compare two rows in dataframe

阅读更多关于 pyspark, Compare two rows in dataframe

问题 I'm attempting to compare one row in a dataframe with the next to see the difference in timestamp. Currently the data looks like: itemid | eventid | timestamp ---------------------------- 134 | 30 | 2016-07-02 12:01:40 134 | 32 | 2016-07-02 12:21:23 125 | 30 | 2016-07-02 13:22:56 125 | 32 | 2016-07-02 13:27:07 I've tried mapping a function onto the dataframe to allow for comparing like this: (note: I'm trying to get rows with a difference greater than 4 hours) items = df.limit(10)\ .orderBy(

How to derive Percentile using Spark Data frame and GroupBy in python

阅读更多关于 How to derive Percentile using Spark Data frame and GroupBy in python

问题 I have a Spark dataframe which has Date , Group and Price columns. I'm trying to derive the percentile(0.6) for the Price column of that dataframe in Python. Besides, I need to add the output as a new column. I tried the code below: perudf = udf(lambda x: x.quantile(.6)) df1 = df.withColumn("Percentile", df.groupBy("group").agg("group"),perudf('price')) but it is throwing the following error: assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column" AssertionError: all

How do you add a numpy.array as a new column to a pyspark.SQL DataFrame?

阅读更多关于 How do you add a numpy.array as a new column to a pyspark.SQL DataFrame?

问题 Here is the code to create a pyspark.sql DataFrame import numpy as np import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext df = pd.DataFrame(np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]]), columns=['a','b','c']) sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1) So that sparkdf looks like a b c 1 2 3 4 5 6 7 8 9 10 11 12 Now I would like to add as a new column a numpy array (or even a list) new_col = np.array([20,20,20,20]) But the standard way

What does df.repartition with no column arguments partition on?

阅读更多关于 What does df.repartition with no column arguments partition on?

问题 In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key. My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself. def repartition(self, numPartitions, *cols): """ Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned. :param numPartitions: can be an

saving a list of rows to a Hive table in pyspark

阅读更多关于 saving a list of rows to a Hive table in pyspark

问题 I have a pyspark app. I copied a hive table to my hdfs directory, & in python I sqlContext.sql a query on this table. Now this variable is a dataframe I call rows . I need to randomly shuffle the rows , so I had to convert them to a list of rows rows_list = rows.collect() . So then I shuffle(rows_list) which shuffles the lists in place. I take the amount of random rows I need x : for r in range(x): allrows2add.append(rows_list[r]) Now I want to save allrows2add as a hive table OR append an