apache-spark-sql | 易学教程

Count of values in a row in spark dataframe using scala

阅读更多关于 Count of values in a row in spark dataframe using scala

问题 I have a dataframe. It contains the amount of sales for different items across different sales outlets. The dataframe shown below only shows few of the items across few sales outlets. There's a bench mark of 100 items per day sale for each item. For each item that's sold more than 100, it is marked as "Yes" and those below 100 is marked as "No" val df1 = Seq( ("Mumbai", 90, 109, , 101, 78, ............., "No", "Yes", "Yes", "No", .....), ("Singapore", 149, 129, , 201, 107, ............., "Yes

SQL or Pyspark - Get the last time a column had a different value for each ID

阅读更多关于 SQL or Pyspark - Get the last time a column had a different value for each ID

问题 I am using pyspark so I have tried both pyspark code and SQL. I am trying to get the time that the ADDRESS column was a different value, grouped by USER_ID. The rows are ordered by TIME. Take the below table: +---+-------+-------+----+ | ID|USER_ID|ADDRESS|TIME| +---+-------+-------+----+ | 1| 1| A| 10| | 2| 1| B| 15| | 3| 1| A| 20| | 4| 1| A| 40| | 5| 1| A| 45| +---+-------+-------+----+ The correct new column I would like is as below: +---+-------+-------+----+---------+ | ID|USER_ID

Spark-shell : The number of columns doesn't match

阅读更多关于 Spark-shell : The number of columns doesn't match

问题 I have csv format file and is separated by delimiter pipe "|". And the dataset has 2 column, like below . Column1|Column2 1|Name_a 2|Name_b But sometimes we receive only one column value and other is missing like below Column1|Column2 1|Name_a 2|Name_b 3 4 5|Name_c 6 7|Name_f So any row having mismatched column no is a garbage value for us for the above example it will be rows having column value as 3, 4, and 6 and we want to discard these rows. Is there any direct way I can discard those

Spark-shell : The number of columns doesn't match

阅读更多关于 Spark-shell : The number of columns doesn't match

Create dataframe with schema provided as JSON file

阅读更多关于 Create dataframe with schema provided as JSON file

问题 How can I create a pyspark data frame with 2 JSON files? file1: this file has complete data file2: this file has only the schema of file1 data. file1 {"RESIDENCY":"AUS","EFFDT":"01-01-1900","EFF_STATUS":"A","DESCR":"Australian Resident","DESCRSHORT":"Australian"} file2 [{"fields":[{"metadata":{},"name":"RESIDENCY","nullable":true,"type":"string"},{"metadata":{},"name":"EFFDT","nullable":true,"type":"string"},{"metadata":{},"name":"EFF_STATUS","nullable":true,"type":"string"},{"metadata":{},

Create dataframe with schema provided as JSON file

阅读更多关于 Create dataframe with schema provided as JSON file

Count distinct in window functions

阅读更多关于 Count distinct in window functions

问题 I was trying to count of unique column b for each c, with out doing group by. I know this could be done with join. how to do count(distinct b) over (partition by c) with out resorting to join. Why are count distinct not supported in window functions. Thank you in advance. Given this data frame: val df= Seq(("a1","b1","c1"), ("a2","b2","c1"), ("a3","b3","c1"), ("a31",null,"c1"), ("a32",null,"c1"), ("a4","b4","c11"), ("a5","b5","c11"), ("a6","b6","c11"), ("a7","b1","c2"), ("a8","b1","c3"), ("a9

Cosine Similarity for two pyspark dataframes

阅读更多关于 Cosine Similarity for two pyspark dataframes

问题 I have a PySpark DataFrame, df1, that looks like: CustomerID CustomerValue CustomerValue2 12 .17 .08 I have a second PySpark DataFrame, df2 CustomerID CustomerValue CustomerValue 15 .17 .14 16 .40 .43 18 .86 .09 I want to take the cosine similarity of the two dataframes. And have something like that CustomerID CustomerID CosineCustVal CosineCustVal 15 12 1 .90 16 12 .45 .67 18 12 .8 .04 回答1: You can calculate cosine similarity only for two vectors, not for two numbers. That said, if the

Spark: subtract dataframes but preserve duplicate values

阅读更多关于 Spark: subtract dataframes but preserve duplicate values

问题 Suppose I have two Spark SQL dataframes A and B . I want to subtract the items in B from the items in A while preserving duplicates from A . I followed the instructions to use DataFrame.except() that I found in another StackOverflow question ("Spark: subtract two DataFrames"), but that function removes all duplicates from the original dataframe A . As a conceptual example, if I have two dataframes: words = [the, quick, fox, a, brown, fox] stopWords = [the, a] then I want the output to be, in

How can I configure spark so that it creates “_$folder$” entries in S3?

阅读更多关于 How can I configure spark so that it creates “_$folder$” entries in S3?

问题 When I write my dataframe to S3 using df.write .format("parquet") .mode("overwrite") .partitionBy("year", "month", "day", "hour", "gen", "client") .option("compression", "gzip") .save("s3://xxxx/yyyy") I get the following in S3 year=2018 year=2019 but I would like to have this instead: year=2018 year=2018_$folder$ year=2019 year=2019_$folder$ The scripts that are reading from that S3 location depend on the *_$folder$ entries, but I haven't found a way to configure spark/hadoop to generate