apache-spark-sql | 易学教程

pyspark. zip arrays in a dataframe

阅读更多关于 pyspark. zip arrays in a dataframe

问题 I have the following PySpark DataFrame: +------+----------------+ | id| data | +------+----------------+ | 1| [10, 11, 12]| | 2| [20, 21, 22]| | 3| [30, 31, 32]| +------+----------------+ At the end, I want to have the following DataFrame +--------+----------------------------------+ | id | data | +--------+----------------------------------+ | [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]| +--------+----------------------------------+ I order to do this. First I extract the data arrays as

pyspark. zip arrays in a dataframe

阅读更多关于 pyspark. zip arrays in a dataframe

How to parse and transform json string from spark data frame rows in pyspark

阅读更多关于 How to parse and transform json string from spark data frame rows in pyspark

问题 How to parse and transform json string from spark dataframe rows in pyspark? I'm looking for help how to parse: json string to json struct output 1 transform json string to columns a, b and id output 2 Background: I get via API json strings with a large number of rows ( jstr1 , jstr2 , ...), which are saved to spark df . I can read schema for each row separately, but this is not the solution as it is very slow as schema has a large number of rows. Each jstr has the same schema, columns/keys a

Unsupported Array error when reading JDBC source in (Py)Spark?

阅读更多关于 Unsupported Array error when reading JDBC source in (Py)Spark?

问题 Trying to convert postgreSQL DB to Dataframe . Following is my code: from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Connect to DB") \ .getOrCreate() jdbcUrl = "jdbc:postgresql://XXXXXX" connectionProperties = { "user" : " ", "password" : " ", "driver" : "org.postgresql.Driver" } query = "(SELECT table_name FROM information_schema.tables) XXX" df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties) table_name_list = df.select("table_name")

How to obtain the average of an array-type column in scala-spark over all row entries per entry?

阅读更多关于 How to obtain the average of an array-type column in scala-spark over all row entries per entry?

问题 I got an array column with 512 double elements, and want to get the average. Take an array column with length=3 as example: val x = Seq("2 4 6", "0 0 0").toDF("value").withColumn("value", split($"value", " ")) x.printSchema() x.show() root |-- value: array (nullable = true) | |-- element: string (containsNull = true) +---------+ | value| +---------+ |[2, 4, 6]| |[0, 0, 0]| +---------+ The following result is desired: x.select(..... as "avg_value").show() ------------ |avg_value | ------------

How to obtain the average of an array-type column in scala-spark over all row entries per entry?

阅读更多关于 How to obtain the average of an array-type column in scala-spark over all row entries per entry?

Export spark feature transformation pipeline to a file

阅读更多关于 Export spark feature transformation pipeline to a file

问题 PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations. 回答1: I see 2 options wrt Mleap: 1) implement dataframe based transformers and the SQLTransformer -Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of

Export spark feature transformation pipeline to a file

阅读更多关于 Export spark feature transformation pipeline to a file

Export spark feature transformation pipeline to a file

阅读更多关于 Export spark feature transformation pipeline to a file

How to join two JDBC tables and avoid Exchange?

阅读更多关于 How to join two JDBC tables and avoid Exchange?

问题 I've got ETL-like scenario, in which I read data from multiple JDBC tables and files and perform some aggregations and join between sources. In one step I must join two JDBC tables. I've tried to do something like: val df1 = spark.read.format("jdbc") .option("url", Database.DB_URL) .option("user", Database.DB_USER) .option("password", Database.DB_PASSWORD) .option("dbtable", tableName) .option("driver", Database.DB_DRIVER) .option("upperBound", data.upperBound) .option("lowerBound", data