pyspark-sql | 易学教程

Identify Partition Key Column from a table using PySpark

阅读更多关于 Identify Partition Key Column from a table using PySpark

问题 I need help to find the unique partitions column names for a Hive table using PySpark. The table might have multiple partition columns and preferable the output should return a list of the partition columns for the Hive Table. It would be great if the result would also include the datatype of the partitioned columns. Any suggestions will be helpful. 回答1: It can be done using desc as shown below: df=spark.sql("""desc test_dev_db.partition_date_table""") >>> df.show(truncate=False) +-----------

How to overwrite the rdd saveAsPickleFile(path) if file already exist in pyspark?

阅读更多关于 How to overwrite the rdd saveAsPickleFile(path) if file already exist in pyspark?

问题 How to overwrite RDD output objects any existing path when we are saving time. test1: 975078|56691|2.000|20171001_926_570_1322 975078|42993|1.690|20171001_926_570_1322 975078|46462|2.000|20171001_926_570_1322 975078|87815|1.000|20171001_926_570_1322 rdd=sc.textFile('/home/administrator/work/test1').map( lambda x: x.split("|")[:4]).map( lambda r: Row( user_code = r[0],item_code = r[1],qty = float(r[2]))) rdd.coalesce(1).saveAsPickleFile("/home/administrator/work/foobar_seq1") The first time it

Trying to connect to Oracle from Spark

阅读更多关于 Trying to connect to Oracle from Spark

问题 I am trying to connect to Oracle to Spark and want pull data from some table and SQL queries. But I am not able to connect to Oracle. I have tried different work around options, but no look. I have followed the below steps. Please correct me if I need to make any changes. I am using Windows 7 machine. I using Jupyter notebook to use Pyspark. I have python 2.7 and Spark 2.1.0. I have set a spark Class path in environment variables: SPARK_CLASS_PATH = C:\Oracle\Product\11.2.0\client_1\jdbc\lib

Trying to connect to Oracle from Spark

阅读更多关于 Trying to connect to Oracle from Spark

Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

阅读更多关于 Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

问题 I am working on updating a mysql database using pyspark framework, and running on AWS Glue services. I have a dataframe as follows: df2= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 55","NY"),("xxx2","81A01","TERR NAME 55","NY"),("x103","81A01","TERR NAME 01","NJ")], ["zip_code","territory_code","territory_name","state"]) # Print out information about this data df2.show() +--------+--------------+--------------+-----+ |zip_code|territory_code|territory_name|state| +--------+--------

Sum of array elements depending on value condition pyspark

阅读更多关于 Sum of array elements depending on value condition pyspark

问题 I have a pyspark dataframe: id | column ------------------------------ 1 | [0.2, 2, 3, 4, 3, 0.5] ------------------------------ 2 | [7, 0.3, 0.3, 8, 2,] ------------------------------ I would like to create a 3 columns: Column 1 : contain the sum of the elements < 2 Column 2 : contain the sum of the elements > 2 Column 3 : contain the sum of the elements = 2 (some times I have duplicate values so I do their sum) In case if I don't have a values I put null. Expect result: id | column | column

How to set an existing field as a primary key in spark dataframe?

阅读更多关于 How to set an existing field as a primary key in spark dataframe?

问题 When I am writing the data of a spark dataframe into SQL DB by using JDBC connector. It is overwritting the properties of the table. So, i want to set the keyfield in spark dataframe before writing the data. url = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4};encrypt=true;trustServerCertificate=false; hostNameInCertificate=*.database.windows.net;loginTimeout=30;".format(jdbcHostname, jdbcPort, jdbcDatabase, JDBCusername, JDBCpassword) newSchema_Product_names = [StructField(

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

阅读更多关于 Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

问题 I have the same problem as asked here but I need a solution in pyspark and without breeze. For example if my pyspark dataframe look like this: user | weight | vec "u1" | 0.1 | [2, 4, 6] "u1" | 0.5 | [4, 8, 12] "u2" | 0.5 | [20, 40, 60] where column weight has type double and column vec has type Array[Double], I would like to get the weighted sum of the vectors per user, so that I get a dataframe that look like this: user | wsum "u1" | [2.2, 4.4, 6.6] "u2" | [10, 20, 30] To do this I have

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

阅读更多关于 Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

How to unstack dataset (using pivot)?

阅读更多关于 How to unstack dataset (using pivot)?

问题 I tried the new "pivot" function of 1.6 on a larger stacked dataset. It has 5,656,458 rows and the IndicatorCode column has 1344 different codes. The idea was to use pivot to "unstack" (in pandas terms) this data set and have a column for each IndicatorCode. schema = StructType([ \ StructField("CountryName", StringType(), True), \ StructField("CountryCode", StringType(), True), \ StructField("IndicatorName", StringType(), True), \ StructField("IndicatorCode", StringType(), True), \