apache-spark-sql | 易学教程

How to join two JDBC tables and avoid Exchange?

阅读更多关于 How to join two JDBC tables and avoid Exchange?

问题 I've got ETL-like scenario, in which I read data from multiple JDBC tables and files and perform some aggregations and join between sources. In one step I must join two JDBC tables. I've tried to do something like: val df1 = spark.read.format("jdbc") .option("url", Database.DB_URL) .option("user", Database.DB_USER) .option("password", Database.DB_PASSWORD) .option("dbtable", tableName) .option("driver", Database.DB_DRIVER) .option("upperBound", data.upperBound) .option("lowerBound", data

How to join two JDBC tables and avoid Exchange?

阅读更多关于 How to join two JDBC tables and avoid Exchange?

spark - set null when column not exist in dataframe

阅读更多关于 spark - set null when column not exist in dataframe

问题 I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C.. If I run this command from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.sql("SELECT A,B,C FROM table") after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error? 回答1: DataFrameReader.json method provides optional schema argument you can use

spark - set null when column not exist in dataframe

阅读更多关于 spark - set null when column not exist in dataframe

spark - set null when column not exist in dataframe

阅读更多关于 spark - set null when column not exist in dataframe

How to sort a column with Date and time values in Spark?

阅读更多关于 How to sort a column with Date and time values in Spark?

问题 Note: I have this as a Dataframe in spark. This Time/Date values constitute a single column in the Dataframe. Input: 04-NOV-16 03.36.13.000000000 PM 06-NOV-15 03.42.21.000000000 PM 05-NOV-15 03.32.05.000000000 PM 06-NOV-15 03.32.14.000000000 AM Expected Output: 05-NOV-15 03.32.05.000000000 PM 06-NOV-15 03.32.14.000000000 AM 06-NOV-15 03.42.21.000000000 PM 04-NOV-16 03.36.13.000000000 PM 回答1: As this format is not standard, you need to use the unix_timestamp function to parse the string and

Getting error saying “Queries with streaming sources must be executed with writeStream.start()” on spark structured streaming [duplicate]

阅读更多关于 Getting error saying “Queries with streaming sources must be executed with writeStream.start()” on spark structured streaming [duplicate]

问题 This question already has answers here : How to display a streaming DataFrame (as show fails with AnalysisException)? (2 answers) Closed 2 years ago . I am getting some issues while executing spark SQL on top of spark structures streaming. PFA for error. here is my code object sparkSqlIntegration { def main(args: Array[String]) { val spark = SparkSession .builder .appName("StructuredStreaming") .master("local[*]") .config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work

Array of struct parsing in Spark dataframe

阅读更多关于 Array of struct parsing in Spark dataframe

问题 I have a Dataframe with one struct type column. Sample dataframe schema is: root |-- Data: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) | | |-- value: string (nullable = true) Field name holds column name and fields value holds column value. Number of elements in Data column is not defined so it can vary. I need to parse that data and get rid of nested structure. (Array Explode will not work in this case because data in one row

scala - how to substring column names after the last dot?

阅读更多关于 scala - how to substring column names after the last dot?

问题 After exploding a nested structure I have a DataFrame with column names like this: sales_data.metric1 sales_data.type.metric2 sales_data.type3.metric3 When performing a select I'm getting the error: cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3] How should I select from the DataFrame so the column names are parsed correctly? I've tried the following: the substrings after dots are extracted successfully. But

Creting UDF function with NonPrimitive Data Type and using in Spark-sql Query: Scala

阅读更多关于 Creting UDF function with NonPrimitive Data Type and using in Spark-sql Query: Scala

问题 I am creating one function in scala which i want to use in my spark-sql query.my query is working fine in hive or if i am giving the same query in spark sql but the same query i'm using at multiple places so i want to create it as reusable function/method so whenever its required i can just call it. I have created below function in my scala class. def date_part(date_column:Column) = { val m1: Column = month(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM-yyyy")))) //give value as 01