pyspark-sql | 易学教程

PySpark sqlContext read Postgres 9.6 NullPointerException

阅读更多关于 PySpark sqlContext read Postgres 9.6 NullPointerException

问题 Trying to read a table with PySpark from a Postgres DB. I have set up the following code and verified SparkContext exists: import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /tmp/jars/postgresql-42.0.0.jar --jars /tmp/jars/postgresql-42.0.0.jar pyspark-shell' from pyspark import SparkContext, SparkConf conf = SparkConf() conf.setMaster("local[*]") conf.setAppName('pyspark') sc = SparkContext(conf=conf) from pyspark.sql import SQLContext properties = { "driver": "org.postgresql

pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

阅读更多关于 pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

问题 I'm running pyspark-sql code on Horton sandbox 18/08/11 17:02:22 INFO spark.SparkContext: Running Spark version 1.6.3 # code from pyspark.sql import * from pyspark.sql.types import * rdd1 = sc.textFile ("/user/maria_dev/spark_data/products.csv") rdd2 = rdd1.map( lambda x : x.split("," ) ) df1 = sqlContext.createDataFrame(rdd2, ["id","cat_id","name","desc","price", "url"]) df1.printSchema() root |-- id: string (nullable = true) |-- cat_id: string (nullable = true) |-- name: string (nullable =

Correlated sub query column in SPARK SQL is not allowed as part of a non-equality predicate

阅读更多关于 Correlated sub query column in SPARK SQL is not allowed as part of a non-equality predicate

问题 I am tryng to write a subquery in where clause like below. But i am getting "Correlated column is not allowed in a non-equality predicate:" SELECT *, holidays FROM ( SELECT *, s.holidays, s.entity FROM transit_t tt WHERE ( SELECT Count(thedate) AS holidays FROM fact_ent_rt WHERE entity=tt.awborigin AND ( Substring(thedate,1,10)) BETWEEN (Substring(awbpickupdate,1,10)) AND ( Substring(deliverydate,1,10)) AND ( nholidayflag = true OR weekendflag = true))) s Any issues with this query. because i

Selecting or removing duplicate columns from spark dataframe

阅读更多关于 Selecting or removing duplicate columns from spark dataframe

问题 Given a spark dataframe, with a duplicate columns names (eg. A ) for which I cannot modify the upstream or source , how do I select, remove or rename one of the columns so that I may retrieve the columns values? df.select('A') shows me an ambiguous column error, as does filter , drop , and withColumnRenamed . How do I select one of the columns? 回答1: The only way I found with hours of research is to rename the column set, then create another dataframe with the new set as the header. Eg, if you

Apache Spark group by combining types and sub types

阅读更多关于 Apache Spark group by combining types and sub types

问题 I have this dataset in spark, val sales = Seq( ("Warsaw", 2016, "facebook","share",100), ("Warsaw", 2017, "facebook","like",200), ("Boston", 2015,"twitter","share",50), ("Boston", 2016,"facebook","share",150), ("Toronto", 2017,"twitter","like",50) ).toDF("city", "year","media","action","amount") I can now group this by city and media like this, val groupByCityAndYear = sales .groupBy("city", "media") .count() groupByCityAndYear.show() +-------+--------+-----+ | city| media|count| +-------+---

pyspark's window functions fn.avg() only output same data

阅读更多关于 pyspark's window functions fn.avg() only output same data

问题 Here is my code: import pandas as pd from pyspark.sql import SQLContext import pyspark.sql.functions as fn from pyspark.sql.functions import isnan, isnull from pyspark.sql.functions import lit from pyspark.sql.window import Window spark= SparkSession.builder.appName(" ").getOrCreate() file = "D:\project\HistoryData.csv" lines = pd.read_csv(file) spark_df=spark.createDataFrame(cc,['id','time','average','max','min']) temp = Window.partitionBy("time").orderBy("id").rowsBetween(-1, 1) df = spark

How to use array type column value in CASE statement

阅读更多关于 How to use array type column value in CASE statement

问题 I have a dataframe with two columns, listA stored as Seq[String] and valB stored as String . I want to create a third column valC , which will be of Int type and its value is iff valB is present in listA then 1 otherwise 0 I tried doing the following: val dfWithAdditionalColumn = df.withColumn("valC", when($"listA".contains($"valB"), 1).otherwise(0)) But Spark failed to execute this and gave the following error: cannot resolve 'contains('listA', 'valB')' due to data type mismatch: argument 1

Combine multiple rows into a single row

阅读更多关于 Combine multiple rows into a single row

问题 I am trying to achieve this via pyspark building sql. The goal is to combine multiple rows into single row Example: I want to convert this +-----+----+----+-----+ | col1|col2|col3| col4| +-----+----+----+-----+ |x | y | z |13::1| |x | y | z |10::2| +-----+----+----+-----+ To +-----+----+----+-----------+ | col1|col2|col3| col4| +-----+----+----+-----------+ |x | y | z |13::1;10::2| +-----+----+----+-----------+ 回答1: What you're looking for is the spark-sql version of this answer, which is the

pyspark one to many join operation

阅读更多关于 pyspark one to many join operation

问题 in pyspark dataframe let say there is dfA and dfB, dfA : name , class dfB : class, time if dfA.select('class').distinct().count() = n, when n is n < 100 , n > 100000, when I operating the join for this two cases how should I optimize the join? 来源： https://stackoverflow.com/questions/58026274/pyspark-one-to-many-join-operation

Getting latest dates from each year in a PySpark date column

阅读更多关于 Getting latest dates from each year in a PySpark date column

问题 I have a table like this: +----------+-------------+ | date|BALANCE_DRAWN| +----------+-------------+ |2017-01-10| 2.21496454E7| |2018-01-01| 4.21496454E7| |2018-01-04| 1.21496454E7| |2018-01-07| 4.21496454E7| |2018-01-10| 5.21496454E7| |2019-01-01| 1.21496454E7| |2019-01-04| 2.21496454E7| |2019-01-07| 3.21496454E7| |2019-01-10| 1.21496454E7| |2020-01-01| 5.21496454E7| |2020-01-04| 4.21496454E7| |2020-01-07| 6.21496454E7| |2020-01-10| 3.21496454E7| |2021-01-01| 2.21496454E7| |2021-01-04| 1