apache-spark-sql

How to join two JDBC tables and avoid Exchange?

╄→гoц情女王★ 提交于 2021-02-09 03:00:56
问题 I've got ETL-like scenario, in which I read data from multiple JDBC tables and files and perform some aggregations and join between sources. In one step I must join two JDBC tables. I've tried to do something like: val df1 = spark.read.format("jdbc") .option("url", Database.DB_URL) .option("user", Database.DB_USER) .option("password", Database.DB_PASSWORD) .option("dbtable", tableName) .option("driver", Database.DB_DRIVER) .option("upperBound", data.upperBound) .option("lowerBound", data

How to join two JDBC tables and avoid Exchange?

泪湿孤枕 提交于 2021-02-09 03:00:26
问题 I've got ETL-like scenario, in which I read data from multiple JDBC tables and files and perform some aggregations and join between sources. In one step I must join two JDBC tables. I've tried to do something like: val df1 = spark.read.format("jdbc") .option("url", Database.DB_URL) .option("user", Database.DB_USER) .option("password", Database.DB_PASSWORD) .option("dbtable", tableName) .option("driver", Database.DB_DRIVER) .option("upperBound", data.upperBound) .option("lowerBound", data

spark - set null when column not exist in dataframe

假如想象 提交于 2021-02-09 02:51:04
问题 I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C.. If I run this command from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.sql("SELECT A,B,C FROM table") after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error? 回答1: DataFrameReader.json method provides optional schema argument you can use

spark - set null when column not exist in dataframe

狂风中的少年 提交于 2021-02-09 02:49:16
问题 I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C.. If I run this command from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.sql("SELECT A,B,C FROM table") after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error? 回答1: DataFrameReader.json method provides optional schema argument you can use

spark - set null when column not exist in dataframe

↘锁芯ラ 提交于 2021-02-09 02:48:23
问题 I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C.. If I run this command from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.sql("SELECT A,B,C FROM table") after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error? 回答1: DataFrameReader.json method provides optional schema argument you can use

How to sort a column with Date and time values in Spark?

假如想象 提交于 2021-02-08 15:12:08
问题 Note: I have this as a Dataframe in spark. This Time/Date values constitute a single column in the Dataframe. Input: 04-NOV-16 03.36.13.000000000 PM 06-NOV-15 03.42.21.000000000 PM 05-NOV-15 03.32.05.000000000 PM 06-NOV-15 03.32.14.000000000 AM Expected Output: 05-NOV-15 03.32.05.000000000 PM 06-NOV-15 03.32.14.000000000 AM 06-NOV-15 03.42.21.000000000 PM 04-NOV-16 03.36.13.000000000 PM 回答1: As this format is not standard, you need to use the unix_timestamp function to parse the string and

Getting error saying “Queries with streaming sources must be executed with writeStream.start()” on spark structured streaming [duplicate]

拟墨画扇 提交于 2021-02-08 12:00:31
问题 This question already has answers here : How to display a streaming DataFrame (as show fails with AnalysisException)? (2 answers) Closed 2 years ago . I am getting some issues while executing spark SQL on top of spark structures streaming. PFA for error. here is my code object sparkSqlIntegration { def main(args: Array[String]) { val spark = SparkSession .builder .appName("StructuredStreaming") .master("local[*]") .config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work

Array of struct parsing in Spark dataframe

丶灬走出姿态 提交于 2021-02-08 11:54:14
问题 I have a Dataframe with one struct type column. Sample dataframe schema is: root |-- Data: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) | | |-- value: string (nullable = true) Field name holds column name and fields value holds column value. Number of elements in Data column is not defined so it can vary. I need to parse that data and get rid of nested structure. (Array Explode will not work in this case because data in one row

scala - how to substring column names after the last dot?

倖福魔咒の 提交于 2021-02-08 11:27:34
问题 After exploding a nested structure I have a DataFrame with column names like this: sales_data.metric1 sales_data.type.metric2 sales_data.type3.metric3 When performing a select I'm getting the error: cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3] How should I select from the DataFrame so the column names are parsed correctly? I've tried the following: the substrings after dots are extracted successfully. But

Creting UDF function with NonPrimitive Data Type and using in Spark-sql Query: Scala

梦想与她 提交于 2021-02-08 11:00:42
问题 I am creating one function in scala which i want to use in my spark-sql query.my query is working fine in hive or if i am giving the same query in spark sql but the same query i'm using at multiple places so i want to create it as reusable function/method so whenever its required i can just call it. I have created below function in my scala class. def date_part(date_column:Column) = { val m1: Column = month(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM-yyyy")))) //give value as 01