pyspark-sql

PySpark insert overwrite issue

邮差的信 提交于 2019-12-11 20:08:12
问题 Below are the last 2 lines of the PySpark ETL code: df_writer = DataFrameWriter(usage_fact) df_writer.partitionBy("data_date", "data_product").saveAsTable(usageWideFactTable, format=fileFormat,mode=writeMode,path=usageWideFactpath) Where, WriteMode= append and fileFormat=orc I wanted to use insert overwrite in place of this so that my data is not getting appended when I re-run the code. Hence I have used this: usage_fact.createOrReplaceTempView("usage_fact") fact = spark.sql("insert overwrite

How to get strings separated by commas from a list to a query in PySpark?

三世轮回 提交于 2019-12-11 18:27:41
问题 I want to generate a query by using a list in PySpark list = ["hi@gmail.com", "goodbye@gmail.com"] query = "SELECT * FROM table WHERE email IN (" + list + ")" This is my desired output: query SELECT * FROM table WHERE email IN ("hi@gmail.com", "goodbye@gmail.com") Instead I'm getting: TypeError: cannot concatenate 'str' and 'list' objects Can anyone help me achieve this? Thanks 回答1: If someone's having the same issue, I found that you can use the following code: "'"+"','".join(map(str, emails

How to concatenate/append multiple Spark dataframes column wise in Pyspark?

℡╲_俬逩灬. 提交于 2019-12-11 16:33:13
问题 How to do pandas equivalent of pd.concat([df1,df2],axis='columns') using Pyspark dataframes? I googled and couldn't find a good solution. DF1 var1 3 4 5 DF2 var2 var3 23 31 44 45 52 53 Expected output dataframe var1 var2 var3 3 23 31 4 44 45 5 52 53 Edited to include expected output 回答1: Below is the example for what you want to do but in scala, I hope you can convert it to pyspark val spark = SparkSession .builder() .master("local") .appName("ParquetAppendMode") .getOrCreate() import spark

pyspark - get consistent random value across Spark sessions

允我心安 提交于 2019-12-11 16:25:19
问题 I want to add a column of random values to a dataframe ( has an id for each row ) for something I am testing. I am struggling to get reproducible results across Spark sessions - same random value against each row id . I am able to reproduce the results by using from pyspark.sql.functions import rand new_df = my_df.withColumn("rand_index", rand(seed = 7)) but it only works when I am running it in same Spark session. I am not getting same results once I relaunch Spark and run my script. I also

Pyspark sql count returns different number of rows than pure sql

坚强是说给别人听的谎言 提交于 2019-12-11 16:24:13
问题 I've started using pyspark in one of my projects. I was testing different commands to explore functionalities of the library and I found something that I don't understand. Take this code: from pyspark import SparkContext from pyspark.sql import HiveContext from pyspark.sql.dataframe import Dataframe sc = SparkContext(sc) hc = HiveContext(sc) hc.sql("use test_schema") hc.table("diamonds").count() the last count() operation returns 53941 records. If I run instead a select count(*) from diamonds

Spark sql query causing huge data shuffle read / write

半腔热情 提交于 2019-12-11 16:06:23
问题 I am using spark sql for processing the data. Here is the query select /*+ BROADCAST (C) */ A.party_id, IF(B.master_id is NOT NULL, B.master_id, 'MISSING_LINK') as master_id, B.is_matched, D.partner_name, A.partner_id, A.event_time_utc, A.funnel_stage_type, A.product_id_set, A.ip_address, A.session_id, A.tdm_retailer_id, C.product_name , CASE WHEN C.product_category_lvl_01 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_01 END as product_category_lvl_01, CASE WHEN C.product_category_lvl_02

How to find average of a array column based on index in pyspark

浪尽此生 提交于 2019-12-11 15:57:41
问题 I have data as shown below ----------------------------- place | key | weights ---------------------------- amazon | lion | [ 34, 23, 56 ] north | bear | [ 90, 45] amazon | lion | [ 38, 30, 50 ] amazon | bear | [ 45 ] amazon | bear | [ 40 ] I trying to get the result like below ----------------------------- place | key | average ---------------------------- amazon | lion1 | 36.0 #(34 + 38)/2 amazon | lion2 | 26.5 #(23 + 30)/2 amazon | lion3 | 53.0 #(50 + 56)/2 north | bear1 | 90 #(90)/1 north

Write to Postgres from Dataricks using Python [duplicate]

邮差的信 提交于 2019-12-11 15:48:00
问题 This question already has answers here : How to use JDBC source to write and read data in (Py)Spark? (3 answers) Closed last year . I have a dataframe in Databricks called customerDetails. +--------------------+-----------+ | customerName| customerId| +--------------------+-----------+ |John Smith | 0001| |Jane Burns | 0002| |Frank Jones | 0003| +--------------------+-----------+ I would like to be able to copy this from Databricks to a table within Postgres. I found this post which used

Check if two pyspark Rows are equal

你。 提交于 2019-12-11 15:46:17
问题 I am writing unit tests for a Spark job, and some of the outputs are named tuples: pyspark.sql.Row How can I assert their equality? actual = get_data(df) expected = Row(total=4, unique_ids=2) self.assertEqual(actual, expected) When I do this, the values are rearranged in an order I can not determine. 回答1: Your code should work as written because according to the docs: the fields will be sorted by names. Nevertheless, another way is to use the asDict() method of the pySpark.sql.Row and compare

How to update pyspark dataframe metadata on Spark 2.1?

主宰稳场 提交于 2019-12-11 14:57:40
问题 I'm facing an issue with the OneHotEncoder of SparkML since it reads dataframe metadata in order to determine the value range it should assign for the sparse vector object its creating. More specifically, I'm encoding a "hour" field using a training set containing all individual values between 0 and 23. Now I'm scoring a single row data frame using the "transform" method od the Pipeline. Unfortunately, this leads to a differently encoded sparse vector object for the OneHotEncoder (24,[5],[1.0