pyspark-sql

Identify Partition Key Column from a table using PySpark

此生再无相见时 提交于 2020-02-05 03:46:09
问题 I need help to find the unique partitions column names for a Hive table using PySpark. The table might have multiple partition columns and preferable the output should return a list of the partition columns for the Hive Table. It would be great if the result would also include the datatype of the partitioned columns. Any suggestions will be helpful. 回答1: It can be done using desc as shown below: df=spark.sql("""desc test_dev_db.partition_date_table""") >>> df.show(truncate=False) +-----------

How to overwrite the rdd saveAsPickleFile(path) if file already exist in pyspark?

狂风中的少年 提交于 2020-02-02 12:45:36
问题 How to overwrite RDD output objects any existing path when we are saving time. test1: 975078|56691|2.000|20171001_926_570_1322 975078|42993|1.690|20171001_926_570_1322 975078|46462|2.000|20171001_926_570_1322 975078|87815|1.000|20171001_926_570_1322 rdd=sc.textFile('/home/administrator/work/test1').map( lambda x: x.split("|")[:4]).map( lambda r: Row( user_code = r[0],item_code = r[1],qty = float(r[2]))) rdd.coalesce(1).saveAsPickleFile("/home/administrator/work/foobar_seq1") The first time it

Trying to connect to Oracle from Spark

旧巷老猫 提交于 2020-02-01 07:24:06
问题 I am trying to connect to Oracle to Spark and want pull data from some table and SQL queries. But I am not able to connect to Oracle. I have tried different work around options, but no look. I have followed the below steps. Please correct me if I need to make any changes. I am using Windows 7 machine. I using Jupyter notebook to use Pyspark. I have python 2.7 and Spark 2.1.0. I have set a spark Class path in environment variables: SPARK_CLASS_PATH = C:\Oracle\Product\11.2.0\client_1\jdbc\lib

Trying to connect to Oracle from Spark

房东的猫 提交于 2020-02-01 07:23:04
问题 I am trying to connect to Oracle to Spark and want pull data from some table and SQL queries. But I am not able to connect to Oracle. I have tried different work around options, but no look. I have followed the below steps. Please correct me if I need to make any changes. I am using Windows 7 machine. I using Jupyter notebook to use Pyspark. I have python 2.7 and Spark 2.1.0. I have set a spark Class path in environment variables: SPARK_CLASS_PATH = C:\Oracle\Product\11.2.0\client_1\jdbc\lib

Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

大憨熊 提交于 2020-01-28 10:23:49
问题 I am working on updating a mysql database using pyspark framework, and running on AWS Glue services. I have a dataframe as follows: df2= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 55","NY"),("xxx2","81A01","TERR NAME 55","NY"),("x103","81A01","TERR NAME 01","NJ")], ["zip_code","territory_code","territory_name","state"]) # Print out information about this data df2.show() +--------+--------------+--------------+-----+ |zip_code|territory_code|territory_name|state| +--------+--------

Sum of array elements depending on value condition pyspark

a 夏天 提交于 2020-01-28 02:31:14
问题 I have a pyspark dataframe: id | column ------------------------------ 1 | [0.2, 2, 3, 4, 3, 0.5] ------------------------------ 2 | [7, 0.3, 0.3, 8, 2,] ------------------------------ I would like to create a 3 columns: Column 1 : contain the sum of the elements < 2 Column 2 : contain the sum of the elements > 2 Column 3 : contain the sum of the elements = 2 (some times I have duplicate values so I do their sum) In case if I don't have a values I put null. Expect result: id | column | column

How to set an existing field as a primary key in spark dataframe?

↘锁芯ラ 提交于 2020-01-26 04:29:11
问题 When I am writing the data of a spark dataframe into SQL DB by using JDBC connector. It is overwritting the properties of the table. So, i want to set the keyfield in spark dataframe before writing the data. url = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4};encrypt=true;trustServerCertificate=false; hostNameInCertificate=*.database.windows.net;loginTimeout=30;".format(jdbcHostname, jdbcPort, jdbcDatabase, JDBCusername, JDBCpassword) newSchema_Product_names = [StructField(

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

懵懂的女人 提交于 2020-01-25 06:48:25
问题 I have the same problem as asked here but I need a solution in pyspark and without breeze. For example if my pyspark dataframe look like this: user | weight | vec "u1" | 0.1 | [2, 4, 6] "u1" | 0.5 | [4, 8, 12] "u2" | 0.5 | [20, 40, 60] where column weight has type double and column vec has type Array[Double], I would like to get the weighted sum of the vectors per user, so that I get a dataframe that look like this: user | wsum "u1" | [2.2, 4.4, 6.6] "u2" | [10, 20, 30] To do this I have

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

自闭症网瘾萝莉.ら 提交于 2020-01-25 06:48:09
问题 I have the same problem as asked here but I need a solution in pyspark and without breeze. For example if my pyspark dataframe look like this: user | weight | vec "u1" | 0.1 | [2, 4, 6] "u1" | 0.5 | [4, 8, 12] "u2" | 0.5 | [20, 40, 60] where column weight has type double and column vec has type Array[Double], I would like to get the weighted sum of the vectors per user, so that I get a dataframe that look like this: user | wsum "u1" | [2.2, 4.4, 6.6] "u2" | [10, 20, 30] To do this I have

How to unstack dataset (using pivot)?

女生的网名这么多〃 提交于 2020-01-25 03:20:06
问题 I tried the new "pivot" function of 1.6 on a larger stacked dataset. It has 5,656,458 rows and the IndicatorCode column has 1344 different codes. The idea was to use pivot to "unstack" (in pandas terms) this data set and have a column for each IndicatorCode. schema = StructType([ \ StructField("CountryName", StringType(), True), \ StructField("CountryCode", StringType(), True), \ StructField("IndicatorName", StringType(), True), \ StructField("IndicatorCode", StringType(), True), \