pyspark-sql

PySpark sqlContext read Postgres 9.6 NullPointerException

孤人 提交于 2019-12-13 13:51:52
问题 Trying to read a table with PySpark from a Postgres DB. I have set up the following code and verified SparkContext exists: import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /tmp/jars/postgresql-42.0.0.jar --jars /tmp/jars/postgresql-42.0.0.jar pyspark-shell' from pyspark import SparkContext, SparkConf conf = SparkConf() conf.setMaster("local[*]") conf.setAppName('pyspark') sc = SparkContext(conf=conf) from pyspark.sql import SQLContext properties = { "driver": "org.postgresql

pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

跟風遠走 提交于 2019-12-13 04:28:28
问题 I'm running pyspark-sql code on Horton sandbox 18/08/11 17:02:22 INFO spark.SparkContext: Running Spark version 1.6.3 # code from pyspark.sql import * from pyspark.sql.types import * rdd1 = sc.textFile ("/user/maria_dev/spark_data/products.csv") rdd2 = rdd1.map( lambda x : x.split("," ) ) df1 = sqlContext.createDataFrame(rdd2, ["id","cat_id","name","desc","price", "url"]) df1.printSchema() root |-- id: string (nullable = true) |-- cat_id: string (nullable = true) |-- name: string (nullable =

Correlated sub query column in SPARK SQL is not allowed as part of a non-equality predicate

一曲冷凌霜 提交于 2019-12-13 04:26:31
问题 I am tryng to write a subquery in where clause like below. But i am getting "Correlated column is not allowed in a non-equality predicate:" SELECT *, holidays FROM ( SELECT *, s.holidays, s.entity FROM transit_t tt WHERE ( SELECT Count(thedate) AS holidays FROM fact_ent_rt WHERE entity=tt.awborigin AND ( Substring(thedate,1,10)) BETWEEN (Substring(awbpickupdate,1,10)) AND ( Substring(deliverydate,1,10)) AND ( nholidayflag = true OR weekendflag = true))) s Any issues with this query. because i

Selecting or removing duplicate columns from spark dataframe

喜你入骨 提交于 2019-12-13 04:05:51
问题 Given a spark dataframe, with a duplicate columns names (eg. A ) for which I cannot modify the upstream or source , how do I select, remove or rename one of the columns so that I may retrieve the columns values? df.select('A') shows me an ambiguous column error, as does filter , drop , and withColumnRenamed . How do I select one of the columns? 回答1: The only way I found with hours of research is to rename the column set, then create another dataframe with the new set as the header. Eg, if you

Apache Spark group by combining types and sub types

孤街醉人 提交于 2019-12-13 04:04:48
问题 I have this dataset in spark, val sales = Seq( ("Warsaw", 2016, "facebook","share",100), ("Warsaw", 2017, "facebook","like",200), ("Boston", 2015,"twitter","share",50), ("Boston", 2016,"facebook","share",150), ("Toronto", 2017,"twitter","like",50) ).toDF("city", "year","media","action","amount") I can now group this by city and media like this, val groupByCityAndYear = sales .groupBy("city", "media") .count() groupByCityAndYear.show() +-------+--------+-----+ | city| media|count| +-------+---

pyspark's window functions fn.avg() only output same data

孤街醉人 提交于 2019-12-13 03:38:38
问题 Here is my code: import pandas as pd from pyspark.sql import SQLContext import pyspark.sql.functions as fn from pyspark.sql.functions import isnan, isnull from pyspark.sql.functions import lit from pyspark.sql.window import Window spark= SparkSession.builder.appName(" ").getOrCreate() file = "D:\project\HistoryData.csv" lines = pd.read_csv(file) spark_df=spark.createDataFrame(cc,['id','time','average','max','min']) temp = Window.partitionBy("time").orderBy("id").rowsBetween(-1, 1) df = spark

How to use array type column value in CASE statement

耗尽温柔 提交于 2019-12-13 03:38:08
问题 I have a dataframe with two columns, listA stored as Seq[String] and valB stored as String . I want to create a third column valC , which will be of Int type and its value is iff valB is present in listA then 1 otherwise 0 I tried doing the following: val dfWithAdditionalColumn = df.withColumn("valC", when($"listA".contains($"valB"), 1).otherwise(0)) But Spark failed to execute this and gave the following error: cannot resolve 'contains('listA', 'valB')' due to data type mismatch: argument 1

Combine multiple rows into a single row

别等时光非礼了梦想. 提交于 2019-12-13 03:29:43
问题 I am trying to achieve this via pyspark building sql. The goal is to combine multiple rows into single row Example: I want to convert this +-----+----+----+-----+ | col1|col2|col3| col4| +-----+----+----+-----+ |x | y | z |13::1| |x | y | z |10::2| +-----+----+----+-----+ To +-----+----+----+-----------+ | col1|col2|col3| col4| +-----+----+----+-----------+ |x | y | z |13::1;10::2| +-----+----+----+-----------+ 回答1: What you're looking for is the spark-sql version of this answer, which is the

pyspark one to many join operation

为君一笑 提交于 2019-12-13 03:18:18
问题 in pyspark dataframe let say there is dfA and dfB, dfA : name , class dfB : class, time if dfA.select('class').distinct().count() = n, when n is n < 100 , n > 100000, when I operating the join for this two cases how should I optimize the join? 来源: https://stackoverflow.com/questions/58026274/pyspark-one-to-many-join-operation

Getting latest dates from each year in a PySpark date column

独自空忆成欢 提交于 2019-12-12 20:24:02
问题 I have a table like this: +----------+-------------+ | date|BALANCE_DRAWN| +----------+-------------+ |2017-01-10| 2.21496454E7| |2018-01-01| 4.21496454E7| |2018-01-04| 1.21496454E7| |2018-01-07| 4.21496454E7| |2018-01-10| 5.21496454E7| |2019-01-01| 1.21496454E7| |2019-01-04| 2.21496454E7| |2019-01-07| 3.21496454E7| |2019-01-10| 1.21496454E7| |2020-01-01| 5.21496454E7| |2020-01-04| 4.21496454E7| |2020-01-07| 6.21496454E7| |2020-01-10| 3.21496454E7| |2021-01-01| 2.21496454E7| |2021-01-04| 1