pyspark-sql

Join two data frames, select all columns from one and some columns from the other

。_饼干妹妹 提交于 2019-12-28 04:48:05
问题 Let's say I have a spark data frame df1, with several columns (among which the column 'id') and data frame df2 with two columns, 'id' and 'other'. Is there a way to replicate the following command sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id") by using only pyspark functions such as join(), select() and the like? I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter. Thanks! 回答1: Not sure if the

Spark SQL broadcast hint intermediate tables

雨燕双飞 提交于 2019-12-25 01:48:51
问题 I have a problem using Broadcast hints (maybe is some lack of SQL knowledge). I have a query like SELECT * /* broadcast(a) */ FROM a INNER JOIN b ON .... INNER JOIN c on .... I would like to do SELECT * /* broadcast(a) */ FROM a INNER JOIN b ON .... INNER JOIN c /* broadcast(AjoinedwithB) */ on .... I mean, I want to force broadcast join (I would prefer to avoid changing spark parameters to force it everywhere), but I don't know how to refer to the table named AjoinedwithB Of course I can

What does rdd mean in pyspark dataframe

血红的双手。 提交于 2019-12-25 01:24:36
问题 I am new to to pyspark. I am wondering what does rdd mean in pyspark dataframe. weatherData = spark.read.csv('weather.csv', header=True, inferSchema=True) These two line of the code has the same output. I am wondering what the effect of having rdd weatherData.collect() weatherData.rdd.collect() 回答1: A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. So, a DataFrame has additional

“'DataFrame' object has no attribute 'apply'” when trying to apply lambda to create new column

爱⌒轻易说出口 提交于 2019-12-25 00:09:03
问题 I aim at adding a new column in a Pandas DataFrame, but I am facing an weird error. The new column is expected to be a transformation from an existing column, that can be done doing a lookup in a dictionary/hashmap. # Loading data df = sqlContext.read.format(...).load(train_df_path) # Instanciating the map some_map = { 'a': 0, 'b': 1, 'c': 1, } # Creating a new column using the map df['new_column'] = df.apply(lambda row: some_map(row.some_column_name), axis=1) Which leads to the following

How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

只谈情不闲聊 提交于 2019-12-24 21:53:20
问题 This solution in theory, works perfectly for what I need, which is to create a new copied version of a dataframe while excluding certain nested structfields. here is a minimally reproducible artifact of my issue: >>> df.printSchema() root | -- big: array(nullable=true) | | -- element: struct(containsNull=true) | | | -- keep: string(nullable=true) | | | -- delete: string(nullable=true) which you can instantiate like such: schema = StructType([StructField("big", ArrayType(StructType([

Spark SQL - 1 task running for long time due to null values is join key

我是研究僧i 提交于 2019-12-24 19:58:19
问题 I am performing a left join between two tables with 1.3 billion records each however the join key is null in table1 for approx 600 million records and because of this all null records get allocated to 1 task ,hence data skew happens making this 1 task to run for hours. from pyspark.sql import SparkSession spark = SparkSession.builder.appName("report").enableHiveSupport() tbl1 = spark.sql("""select a.col1,b.col2,a.Col3 from table1 a left join table2 b on a.col1 = b.col2""") tbl1.write.mode(

Using LSH in spark to run nearest neighbors query on every point in dataframe

。_饼干妹妹 提交于 2019-12-24 12:34:45
问题 I need k nearest neighbors for each feature vector in the dataframe. I'm using BucketedRandomProjectionLSHModel from pyspark. code for creating the model brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",seed=12345, bucketLength=n) model = brp.fit(data_df) df_lsh = model.transform(data_df) Now, How do I run approx nearest neighbor query for each point in data_df. I have tried broadcasting the model but got pickle error. Also, defining a udf to access the model gives

PySpark sql compare records on each day and report the differences

怎甘沉沦 提交于 2019-12-24 11:13:06
问题 so the problem I have is I have this dataset: and it shows the businesses are doing business in the specific days. what i want to achieve is to report which businesses are added on what day. Perhaps Im lookign for some answer as: I managed to tide up all the records using this sql: select [Date] ,Mnemonic ,securityDesc ,sum(cast(TradedVolume as money)) as TradedVolumSum FROM SomeTable group by [Date],Mnemonic,securityDesc but I dont know how to compare each days record with the other day and

Remove all rows that are duplicates with respect to some rows

浪尽此生 提交于 2019-12-24 09:58:12
问题 I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame: +------+-----+----+ | id|value|type| +------+-----+----+ |283924| 1.5| 0| |283924| 1.5| 1| |982384| 3.0| 0| |982384| 3.0| 1| |892383| 2.0| 0| |892383| 2.5| 1| +------+-----+----+ I want to identify duplicates by just the "id" and "value" columns, and then remove all instances. In this case: Rows 1 and 2 are duplicates (again we are ignoring the "type" column) Rows 3 and 4 are

How to serialize PySpark GroupedData object?

♀尐吖头ヾ 提交于 2019-12-24 09:24:13
问题 I am running a groupBy() on a dataset having several millions of records and want to save the resulting output (a PySpark GroupedData object) so that I can de-serialize it later and resume from that point (running aggregations on top of that as needed). df.groupBy("geo_city") <pyspark.sql.group.GroupedData at 0x10503c5d0> I want to avoid converting the GroupedData object into DataFrames or RDDs in order to save it to text file or Parquet/Avro format (as the conversion operation is expensive).