apache-spark-sql

pyspark. zip arrays in a dataframe

笑着哭i 提交于 2021-02-10 09:32:27
问题 I have the following PySpark DataFrame: +------+----------------+ | id| data | +------+----------------+ | 1| [10, 11, 12]| | 2| [20, 21, 22]| | 3| [30, 31, 32]| +------+----------------+ At the end, I want to have the following DataFrame +--------+----------------------------------+ | id | data | +--------+----------------------------------+ | [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]| +--------+----------------------------------+ I order to do this. First I extract the data arrays as

pyspark. zip arrays in a dataframe

ぐ巨炮叔叔 提交于 2021-02-10 09:31:13
问题 I have the following PySpark DataFrame: +------+----------------+ | id| data | +------+----------------+ | 1| [10, 11, 12]| | 2| [20, 21, 22]| | 3| [30, 31, 32]| +------+----------------+ At the end, I want to have the following DataFrame +--------+----------------------------------+ | id | data | +--------+----------------------------------+ | [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]| +--------+----------------------------------+ I order to do this. First I extract the data arrays as

How to parse and transform json string from spark data frame rows in pyspark

为君一笑 提交于 2021-02-10 07:57:07
问题 How to parse and transform json string from spark dataframe rows in pyspark? I'm looking for help how to parse: json string to json struct output 1 transform json string to columns a, b and id output 2 Background: I get via API json strings with a large number of rows ( jstr1 , jstr2 , ...), which are saved to spark df . I can read schema for each row separately, but this is not the solution as it is very slow as schema has a large number of rows. Each jstr has the same schema, columns/keys a

Unsupported Array error when reading JDBC source in (Py)Spark?

烈酒焚心 提交于 2021-02-10 06:27:50
问题 Trying to convert postgreSQL DB to Dataframe . Following is my code: from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Connect to DB") \ .getOrCreate() jdbcUrl = "jdbc:postgresql://XXXXXX" connectionProperties = { "user" : " ", "password" : " ", "driver" : "org.postgresql.Driver" } query = "(SELECT table_name FROM information_schema.tables) XXX" df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties) table_name_list = df.select("table_name")

How to obtain the average of an array-type column in scala-spark over all row entries per entry?

一世执手 提交于 2021-02-10 04:57:27
问题 I got an array column with 512 double elements, and want to get the average. Take an array column with length=3 as example: val x = Seq("2 4 6", "0 0 0").toDF("value").withColumn("value", split($"value", " ")) x.printSchema() x.show() root |-- value: array (nullable = true) | |-- element: string (containsNull = true) +---------+ | value| +---------+ |[2, 4, 6]| |[0, 0, 0]| +---------+ The following result is desired: x.select(..... as "avg_value").show() ------------ |avg_value | ------------

How to obtain the average of an array-type column in scala-spark over all row entries per entry?

回眸只為那壹抹淺笑 提交于 2021-02-10 04:56:51
问题 I got an array column with 512 double elements, and want to get the average. Take an array column with length=3 as example: val x = Seq("2 4 6", "0 0 0").toDF("value").withColumn("value", split($"value", " ")) x.printSchema() x.show() root |-- value: array (nullable = true) | |-- element: string (containsNull = true) +---------+ | value| +---------+ |[2, 4, 6]| |[0, 0, 0]| +---------+ The following result is desired: x.select(..... as "avg_value").show() ------------ |avg_value | ------------

Export spark feature transformation pipeline to a file

一世执手 提交于 2021-02-09 07:30:30
问题 PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations. 回答1: I see 2 options wrt Mleap: 1) implement dataframe based transformers and the SQLTransformer -Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of

Export spark feature transformation pipeline to a file

心已入冬 提交于 2021-02-09 07:18:38
问题 PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations. 回答1: I see 2 options wrt Mleap: 1) implement dataframe based transformers and the SQLTransformer -Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of

Export spark feature transformation pipeline to a file

江枫思渺然 提交于 2021-02-09 07:14:43
问题 PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations. 回答1: I see 2 options wrt Mleap: 1) implement dataframe based transformers and the SQLTransformer -Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of

How to join two JDBC tables and avoid Exchange?

你说的曾经没有我的故事 提交于 2021-02-09 03:01:10
问题 I've got ETL-like scenario, in which I read data from multiple JDBC tables and files and perform some aggregations and join between sources. In one step I must join two JDBC tables. I've tried to do something like: val df1 = spark.read.format("jdbc") .option("url", Database.DB_URL) .option("user", Database.DB_USER) .option("password", Database.DB_PASSWORD) .option("dbtable", tableName) .option("driver", Database.DB_DRIVER) .option("upperBound", data.upperBound) .option("lowerBound", data