pyspark-sql | 易学教程

Find and remove matching column values in pyspark

阅读更多关于 Find and remove matching column values in pyspark

问题 I have a pyspark dataframe where occasionally the columns will have a wrong value that matches another column. It would look something like this: | Date | Latitude | | 2017-01-01 | 43.4553 | | 2017-01-02 | 42.9399 | | 2017-01-03 | 43.0091 | | 2017-01-04 | 2017-01-04 | Obviously, the last Latitude value is incorrect. I need to remove any and all rows that are like this. I thought about using .isin() but I can't seem to get it to work. If I try df['Date'].isin(['Latitude']) I get: Column<(Date

Drop consecutive duplicates in a pyspark dataframe

阅读更多关于 Drop consecutive duplicates in a pyspark dataframe

问题 Having a dataframe like: ## +---+---+ ## | id|num| ## +---+---+ ## | 2|3.0| ## | 3|6.0| ## | 3|2.0| ## | 3|1.0| ## | 2|9.0| ## | 4|7.0| ## +---+---+ and I want to remove the consecutive repetitions, and obtain: ## +---+---+ ## | id|num| ## +---+---+ ## | 2|3.0| ## | 3|6.0| ## | 2|9.0| ## | 4|7.0| ## +---+---+ I found ways of doing this in Pandas but nothing in Pyspark. 回答1: The answer should work as you desired, however there might be room for some optimization: from pyspark.sql.window import

Spark request max count

阅读更多关于 Spark request max count

问题 I'm a beginner on spark and I try to make a request allow me to retrieve the most visited web pages. My request is the following mostPopularWebPageDF = logDF.groupBy("webPage").agg(functions.count("webPage").alias("cntWebPage")).agg(functions.max("cntWebPage")).show() With this request I retrieve only a dataframe with the max count but I want to retrieve a dataframe with this score and the web page that holds this score Something like that: webPage max(cntWebPage) google.com 2 How can I fix

MySQL read with PySpark

阅读更多关于 MySQL read with PySpark

问题 I have the following test code: from pyspark import SparkContext, SQLContext sc = SparkContext('local') sqlContext = SQLContext(sc) print('Created spark context!') if __name__ == '__main__': df = sqlContext.read.format("jdbc").options( url="jdbc:mysql://localhost/mysql", driver="com.mysql.jdbc.Driver", dbtable="users", user="user", password="****", properties={"driver": 'com.mysql.jdbc.Driver'} ).load() print(df) When I run it, I get the following error: java.lang.ClassNotFoundException: com

pyspark createdataframe: string interpreted as timestamp, schema mixes up columns

阅读更多关于 pyspark createdataframe: string interpreted as timestamp, schema mixes up columns

问题 I have a really strange error with spark dataframes which causes a string to be evaluated as a timestamp. Here is my setup code: from datetime import datetime from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, StringType, TimestampType new_schema = StructType([StructField("item_id", StringType(), True), StructField("date", TimestampType(), True), StructField("description", StringType(), True) ]) df = sqlContext.createDataFrame([Row(description='description',

Pypsark - Retain null values when using collect_list

阅读更多关于 Pypsark - Retain null values when using collect_list

问题 According to the accepted answer in pyspark collect_set or collect_list with groupby, when you do a collect_list on a certain column, the null values in this column are removed. I have checked and this is true. But in my case, I need to keep the null columns -- How can I achieve this? I did not find any info on this kind of a variant of collect_list function. Background context to explain why I want nulls: I have a dataframe df as below: cId | eId | amount | city 1 | 2 | 20.0 | Paris 1 | 2 |

How to filter column on values in list in pyspark?

阅读更多关于 How to filter column on values in list in pyspark?

问题 I have a dataframe rawdata on which i have to apply filter condition on column X with values CB,CI and CR. So I used the below code: df = dfRawData.filter(col("X").between("CB","CI","CR")) But I am getting the following error: between() takes exactly 3 arguments (4 given) Please let me know how I can resolve this issue. 回答1: between is used to check if the value is between two values, the input is a lower bound and an upper bound. It can not be used to check if a column value is in a list. To

syntaxerror defining schema for sparksql dataframe

阅读更多关于 syntaxerror defining schema for sparksql dataframe

问题 My pyspark console is telling me that I have invalid syntax on the line following my for loop. the console doesn't execute the for loop until the schema = StructType(fields) line where it has the SyntaxError, but the for loop looks good to me... from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import * sqlContext = SQLContext(sc) lines = sc.textFile('file:///home/w205/hospital_compare/surveys_responses.csv') parts = lines.map(lambda l: l.split(','))

Spark RDD groupByKey + join vs join performance

阅读更多关于 Spark RDD groupByKey + join vs join performance

问题 I am using Spark on the cluster which I am sharing with others users. So it is not reliable to tell which one of my code runs more efficient just based on the running time. Because when I am running the more efficient code, someone else maybe running huge data works and makes my code executes for a longer time. So can I ask 2 questions here: I was using join function to join 2 RDDs and I am trying to use groupByKey() before using join , like this: rdd1.groupByKey().join(rdd2) seems that it

Timedelta in Pyspark Dataframes - TypeError

阅读更多关于 Timedelta in Pyspark Dataframes - TypeError

问题 I am working on Spark 2.3, Python 3.6 with pyspark 2.3.1 I have a Spark DataFrame where each entry is a workstep, and I want to get some rows together into a work session. This should be done in the below function getSessions . I believe it works. I further create an RDD that contains all information that I want - each entry is a Row object with the desired columns, it looks like the types are fine (some data disguised): rddSessions_flattened.take(1) # [Row(counter=1, end=datetime.datetime