apache-spark-sql | 易学教程

PySpark explode list into multiple columns based on name

阅读更多关于 PySpark explode list into multiple columns based on name

问题 Hi I'm dealing with a slightly difficult file format which I'm trying to clean for some future processing. I've been using Pyspark to process the data into a dataframe. The file looks similar to this: AA 1234 ZXYW BB A 890 CC B 321 AA 1234 LMNO BB D 123 CC E 321 AA 1234 ZXYW CC E 456 Each 'AA' record defines the start of a logical group or records, and the data on each line is fixed length and has information encoded in it that I want to extract. There are at least 20-30 different record

Spark Data set transformation to array [duplicate]

阅读更多关于 Spark Data set transformation to array [duplicate]

问题 This question already has answers here : How to aggregate values into collection after groupBy? (3 answers) Closed 8 months ago . I have a dataset like below; with values of col1 repeating multiple times and unique values of col2. This original dataset can almost a billion rows, so I do not want to use collect or collect_list as it will not scale-out for my use case. Original Dataset: +---------------------| | col1 | col2 | +---------------------| | AA| 11 | | BB| 21 | | AA| 12 | | AA| 13 | |

Computing First Day of Previous Quarter in Spark SQL

阅读更多关于 Computing First Day of Previous Quarter in Spark SQL

问题 How do I derive the first day of the last quarter pertaining to any given date in Spark-SQL query using the SQL API ? Few required samples are as below: input_date | start_date ------------------------ 2020-01-21 | 2019-10-01 2020-02-06 | 2019-10-01 2020-04-15 | 2020-01-01 2020-07-10 | 2020-04-01 2020-10-20 | 2020-07-01 2021-02-04 | 2020-10-01 The Quarters generally are: 1 | Jan - Mar 2 | Apr - Jun 3 | Jul - Sep 4 | Oct - Dec Note:I am using Spark SQL v2.4. Any help is appreciated. Thanks.

How can I make the pyspark and SparkSQL to execute the Hive on Spark?

阅读更多关于 How can I make the pyspark and SparkSQL to execute the Hive on Spark?

问题 I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell / pyspark , I also follow the simple tutorial and achieve to create Hive table, load data and then select properly. Then I move to the next step, setting Hive on Spark. By using hive / beeline , I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive shell displays the following: - hive> select

How can I make the pyspark and SparkSQL to execute the Hive on Spark?

阅读更多关于 How can I make the pyspark and SparkSQL to execute the Hive on Spark?

Converting dataframe to dictionary in pyspark without using pandas

阅读更多关于 Converting dataframe to dictionary in pyspark without using pandas

问题 Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this: dictionary = df_2.unstack().to_dict(orient='index') However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this? EDIT: I have now tried the following approach: dictionary_list = map

Converting dataframe to dictionary in pyspark without using pandas

阅读更多关于 Converting dataframe to dictionary in pyspark without using pandas

spark sql lag, result gets different rows when I change column

阅读更多关于 spark sql lag, result gets different rows when I change column

问题 I'm trying to lag a field when it matches certain conditions, and because I need to use filters, I'm using the MAX function to lag it, as the LAG function itself doesn't work the way I need it. I have been able to do it with the code below for the ID_EVENT_LOG , but when I change the ID_EVENT_LOG inside the MAX to the column ENSAIO , so I would lag the column ENSAIO it doesn't work properly. Example below. Dataset: +------------+---------+------+ |ID_EVENT_LOG|ID_PAINEL|ENSAIO| +------------+

Spark Scala Checkpointing Data Set showing .isCheckpointed = false after Action but directories written

阅读更多关于 Spark Scala Checkpointing Data Set showing .isCheckpointed = false after Action but directories written

问题 There seem to be a few postings on this but none seem to answer what I understand. The following code run on DataBricks: spark.sparkContext.setCheckpointDir("/dbfs/FileStore/checkpoint/cp1/loc7") val checkpointDir = spark.sparkContext.getCheckpointDir.get val ds = spark.range(10).repartition(2) ds.cache() ds.checkpoint() ds.count() ds.rdd.isCheckpointed Added an improvement of sorts: ... val ds2 = ds.checkpoint(eager=true) println(ds2.queryExecution.toRdd.toDebugString) ... returns: (2)

How to parse dynamic Json with dynamic keys inside it in Scala

阅读更多关于 How to parse dynamic Json with dynamic keys inside it in Scala

问题 I am trying to parse Json structure which is dynamic in nature and load into database. But facing difficulty where json has dynamic keys inside it. Below is my sample json: Have tried using explode function but didn't help. moslty similar thing is described here How to parse a dynamic JSON key in a Nested JSON result? { "_id": { "planId": "5f34dab0c661d8337097afb9", "version": { "$numberLong": "1" }, "period": { "name" : "3Q20", "startDate": 20200629, "endDate": 20200927 }, "line": "b443e9c0