apache-spark-sql

Count of values in a row in spark dataframe using scala

笑着哭i 提交于 2021-02-11 12:52:24
问题 I have a dataframe. It contains the amount of sales for different items across different sales outlets. The dataframe shown below only shows few of the items across few sales outlets. There's a bench mark of 100 items per day sale for each item. For each item that's sold more than 100, it is marked as "Yes" and those below 100 is marked as "No" val df1 = Seq( ("Mumbai", 90, 109, , 101, 78, ............., "No", "Yes", "Yes", "No", .....), ("Singapore", 149, 129, , 201, 107, ............., "Yes

SQL or Pyspark - Get the last time a column had a different value for each ID

爱⌒轻易说出口 提交于 2021-02-11 12:14:05
问题 I am using pyspark so I have tried both pyspark code and SQL. I am trying to get the time that the ADDRESS column was a different value, grouped by USER_ID. The rows are ordered by TIME. Take the below table: +---+-------+-------+----+ | ID|USER_ID|ADDRESS|TIME| +---+-------+-------+----+ | 1| 1| A| 10| | 2| 1| B| 15| | 3| 1| A| 20| | 4| 1| A| 40| | 5| 1| A| 45| +---+-------+-------+----+ The correct new column I would like is as below: +---+-------+-------+----+---------+ | ID|USER_ID

Spark-shell : The number of columns doesn't match

岁酱吖の 提交于 2021-02-11 07:44:22
问题 I have csv format file and is separated by delimiter pipe "|". And the dataset has 2 column, like below . Column1|Column2 1|Name_a 2|Name_b But sometimes we receive only one column value and other is missing like below Column1|Column2 1|Name_a 2|Name_b 3 4 5|Name_c 6 7|Name_f So any row having mismatched column no is a garbage value for us for the above example it will be rows having column value as 3, 4, and 6 and we want to discard these rows. Is there any direct way I can discard those

Spark-shell : The number of columns doesn't match

痴心易碎 提交于 2021-02-11 07:44:11
问题 I have csv format file and is separated by delimiter pipe "|". And the dataset has 2 column, like below . Column1|Column2 1|Name_a 2|Name_b But sometimes we receive only one column value and other is missing like below Column1|Column2 1|Name_a 2|Name_b 3 4 5|Name_c 6 7|Name_f So any row having mismatched column no is a garbage value for us for the above example it will be rows having column value as 3, 4, and 6 and we want to discard these rows. Is there any direct way I can discard those

Create dataframe with schema provided as JSON file

戏子无情 提交于 2021-02-11 01:56:22
问题 How can I create a pyspark data frame with 2 JSON files? file1: this file has complete data file2: this file has only the schema of file1 data. file1 {"RESIDENCY":"AUS","EFFDT":"01-01-1900","EFF_STATUS":"A","DESCR":"Australian Resident","DESCRSHORT":"Australian"} file2 [{"fields":[{"metadata":{},"name":"RESIDENCY","nullable":true,"type":"string"},{"metadata":{},"name":"EFFDT","nullable":true,"type":"string"},{"metadata":{},"name":"EFF_STATUS","nullable":true,"type":"string"},{"metadata":{},

Create dataframe with schema provided as JSON file

◇◆丶佛笑我妖孽 提交于 2021-02-11 01:55:13
问题 How can I create a pyspark data frame with 2 JSON files? file1: this file has complete data file2: this file has only the schema of file1 data. file1 {"RESIDENCY":"AUS","EFFDT":"01-01-1900","EFF_STATUS":"A","DESCR":"Australian Resident","DESCRSHORT":"Australian"} file2 [{"fields":[{"metadata":{},"name":"RESIDENCY","nullable":true,"type":"string"},{"metadata":{},"name":"EFFDT","nullable":true,"type":"string"},{"metadata":{},"name":"EFF_STATUS","nullable":true,"type":"string"},{"metadata":{},

Count distinct in window functions

天涯浪子 提交于 2021-02-10 20:26:49
问题 I was trying to count of unique column b for each c, with out doing group by. I know this could be done with join. how to do count(distinct b) over (partition by c) with out resorting to join. Why are count distinct not supported in window functions. Thank you in advance. Given this data frame: val df= Seq(("a1","b1","c1"), ("a2","b2","c1"), ("a3","b3","c1"), ("a31",null,"c1"), ("a32",null,"c1"), ("a4","b4","c11"), ("a5","b5","c11"), ("a6","b6","c11"), ("a7","b1","c2"), ("a8","b1","c3"), ("a9

Cosine Similarity for two pyspark dataframes

穿精又带淫゛_ 提交于 2021-02-10 19:31:21
问题 I have a PySpark DataFrame, df1, that looks like: CustomerID CustomerValue CustomerValue2 12 .17 .08 I have a second PySpark DataFrame, df2 CustomerID CustomerValue CustomerValue 15 .17 .14 16 .40 .43 18 .86 .09 I want to take the cosine similarity of the two dataframes. And have something like that CustomerID CustomerID CosineCustVal CosineCustVal 15 12 1 .90 16 12 .45 .67 18 12 .8 .04 回答1: You can calculate cosine similarity only for two vectors, not for two numbers. That said, if the

Spark: subtract dataframes but preserve duplicate values

南笙酒味 提交于 2021-02-10 14:51:08
问题 Suppose I have two Spark SQL dataframes A and B . I want to subtract the items in B from the items in A while preserving duplicates from A . I followed the instructions to use DataFrame.except() that I found in another StackOverflow question ("Spark: subtract two DataFrames"), but that function removes all duplicates from the original dataframe A . As a conceptual example, if I have two dataframes: words = [the, quick, fox, a, brown, fox] stopWords = [the, a] then I want the output to be, in

How can I configure spark so that it creates “_$folder$” entries in S3?

青春壹個敷衍的年華 提交于 2021-02-10 14:39:47
问题 When I write my dataframe to S3 using df.write .format("parquet") .mode("overwrite") .partitionBy("year", "month", "day", "hour", "gen", "client") .option("compression", "gzip") .save("s3://xxxx/yyyy") I get the following in S3 year=2018 year=2019 but I would like to have this instead: year=2018 year=2018_$folder$ year=2019 year=2019_$folder$ The scripts that are reading from that S3 location depend on the *_$folder$ entries, but I haven't found a way to configure spark/hadoop to generate