spark-dataframe

How to filter one spark dataframe against another dataframe

谁说我不能喝 提交于 2020-01-01 04:19:10
问题 I'm trying to filter one dataframe against another: scala> val df1 = sc.parallelize((1 to 100).map(a=>(s"user $a", a*0.123, a))).toDF("name", "score", "user_id") scala> val df2 = sc.parallelize(List(2,3,4,5,6)).toDF("valid_id") Now I want to filter df1 and get back a dataframe that contains all the rows in df1 where user_id is in df2("valid_id"). In other words, I want all the rows in df1 where the user_id is either 2,3,4,5 or 6 scala> df1.select("user_id").filter($"user_id" in df2("valid_id"

“resolved attribute(s) missing” when performing join on pySpark

假装没事ソ 提交于 2020-01-01 02:10:09
问题 I have the following two pySpark dataframe: > df_lag_pre.columns ['date','sku','name','country','ccy_code','quantity','usd_price','usd_lag','lag_quantity'] > df_unmatched.columns ['alt_sku', 'alt_lag_quantity', 'country', 'ccy_code', 'name', 'usd_price'] Now I want to join them on common columns, so I try the following: > df_lag_pre.join(df_unmatched, on=['name','country','ccy_code','usd_price']) And I get the following error message: AnalysisException: u'resolved attribute(s) price#3424

Spark ML StringIndexer Different Labels Training/Testing

拟墨画扇 提交于 2019-12-31 03:40:08
问题 I'm using Scala and am using StringIndexer to assign indices to each category in my training set. It assigns indices based on the frequency of each category. The problem is that in my testing data, the frequency of the categories are different and so StringIndexer assigns different indices to the categories, which prevents me from evaluating the model (Random Forest) correctly. I am processing the training/testing data in the exact same way, and don't save the model. I have tried manually

Pyspark Merge WrappedArrays Within a Dataframe

只谈情不闲聊 提交于 2019-12-31 03:06:05
问题 The current Pyspark dataframe has this structure (a list of WrappedArrays for col2): +---+---------------------------------------------------------------------+ |id |col2 | +---+---------------------------------------------------------------------+ |a |[WrappedArray(code2), WrappedArray(code1, code3)] | +---+---------------------------------------------------------------------+ |b |[WrappedArray(code5), WrappedArray(code6, code8)] | +---+-------------------------------------------------------

PySpark: org.apache.spark.sql.AnalysisException: Attribute name … contains invalid character(s) among “ ,;{}()\n\t=”. Please use alias to rename it [duplicate]

烂漫一生 提交于 2019-12-31 01:55:11
问题 This question already has answers here : Spark Dataframe validating column names for parquet writes (scala) (4 answers) Closed last year . I'm trying to load Parquet data into PySpark , where a column has a space in the name: df = spark.read.parquet('my_parquet_dump') df.select(df['Foo Bar'].alias('foobar')) Even though I have aliased the column, I'm still getting this error and error propagating from the JVM side of PySpark . I've attached the stack trace below. Is there a way I can load

Mapping json to case class with Spark (spaces in the field name)

ⅰ亾dé卋堺 提交于 2019-12-30 11:02:39
问题 I am trying to read a json file with the spark Dataset API, the problem is that this json contains spaces in some of the field names. This would be a json row {"Field Name" : "value"} My case class needs to be like this case class MyType(`Field Name`: String) Then I can load the file into a DataFrame and it will load the correct schema val dataframe = spark.read.json(path) The problem comes when I try to convert the DataFrame to a Dataset[MyType] dataframe.as[MyType] The StructSchema loaded

Mapping json to case class with Spark (spaces in the field name)

纵饮孤独 提交于 2019-12-30 11:01:30
问题 I am trying to read a json file with the spark Dataset API, the problem is that this json contains spaces in some of the field names. This would be a json row {"Field Name" : "value"} My case class needs to be like this case class MyType(`Field Name`: String) Then I can load the file into a DataFrame and it will load the correct schema val dataframe = spark.read.json(path) The problem comes when I try to convert the DataFrame to a Dataset[MyType] dataframe.as[MyType] The StructSchema loaded

multiple criteria for aggregation on pySpark Dataframe

笑着哭i 提交于 2019-12-30 08:10:11
问题 I have a pySpark dataframe that looks like this: +-------------+----------+ | sku| date| +-------------+----------+ |MLA-603526656|02/09/2016| |MLA-603526656|01/09/2016| |MLA-604172009|02/10/2016| |MLA-605470584|02/09/2016| |MLA-605502281|02/10/2016| |MLA-605502281|02/09/2016| +-------------+----------+ I want to group by sku, and then calculate the min and max dates. If I do this: df_testing.groupBy('sku') \ .agg({'date': 'min', 'date':'max'}) \ .limit(10) \ .show() the behavior is the same

Group By, Rank and aggregate spark data frame using pyspark

会有一股神秘感。 提交于 2019-12-30 02:11:32
问题 I have a dataframe that looks like: A B C --------------- A1 B1 0.8 A1 B2 0.55 A1 B3 0.43 A2 B1 0.7 A2 B2 0.5 A2 B3 0.5 A3 B1 0.2 A3 B2 0.3 A3 B3 0.4 How do I convert the column 'C' to the relative rank(higher score->better rank) per column A? Expected Output: A B Rank --------------- A1 B1 1 A1 B2 2 A1 B3 3 A2 B1 1 A2 B2 2 A2 B3 2 A3 B1 3 A3 B2 2 A3 B3 1 The ultimate state I want to reach is to aggregate column B and store the ranks for each A: Example: B Ranks B1 [1,1,3] B2 [2,2,2] B3 [3,2

How to transform DataFrame before joining operation?

≯℡__Kan透↙ 提交于 2019-12-29 09:41:48
问题 The following code is used to extract ranks from the column products . The ranks are second numbers in each pair [...] . For example, in the given example [[222,66],[333,55]] the ranks are 66 and 55 for products with PK 222 and 333 , accordingly. But the code in Spark 2.2 works very slowly when df_products is around 800 Mb: df_products.createOrReplaceTempView("df_products") val result = df.as("df2") .join(spark.sql("SELECT * FROM df_products") .select($"product_PK", explode($"products").as(