pyspark-sql | 易学教程

Join two data frames, select all columns from one and some columns from the other

阅读更多关于 Join two data frames, select all columns from one and some columns from the other

问题 Let's say I have a spark data frame df1, with several columns (among which the column 'id') and data frame df2 with two columns, 'id' and 'other'. Is there a way to replicate the following command sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id") by using only pyspark functions such as join(), select() and the like? I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter. Thanks! 回答1: Not sure if the

Spark SQL broadcast hint intermediate tables

阅读更多关于 Spark SQL broadcast hint intermediate tables

问题 I have a problem using Broadcast hints (maybe is some lack of SQL knowledge). I have a query like SELECT * /* broadcast(a) */ FROM a INNER JOIN b ON .... INNER JOIN c on .... I would like to do SELECT * /* broadcast(a) */ FROM a INNER JOIN b ON .... INNER JOIN c /* broadcast(AjoinedwithB) */ on .... I mean, I want to force broadcast join (I would prefer to avoid changing spark parameters to force it everywhere), but I don't know how to refer to the table named AjoinedwithB Of course I can

What does rdd mean in pyspark dataframe

阅读更多关于 What does rdd mean in pyspark dataframe

问题 I am new to to pyspark. I am wondering what does rdd mean in pyspark dataframe. weatherData = spark.read.csv('weather.csv', header=True, inferSchema=True) These two line of the code has the same output. I am wondering what the effect of having rdd weatherData.collect() weatherData.rdd.collect() 回答1: A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. So, a DataFrame has additional

“'DataFrame' object has no attribute 'apply'” when trying to apply lambda to create new column

阅读更多关于 “'DataFrame' object has no attribute 'apply'” when trying to apply lambda to create new column

问题 I aim at adding a new column in a Pandas DataFrame, but I am facing an weird error. The new column is expected to be a transformation from an existing column, that can be done doing a lookup in a dictionary/hashmap. # Loading data df = sqlContext.read.format(...).load(train_df_path) # Instanciating the map some_map = { 'a': 0, 'b': 1, 'c': 1, } # Creating a new column using the map df['new_column'] = df.apply(lambda row: some_map(row.some_column_name), axis=1) Which leads to the following

How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

阅读更多关于 How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

问题 This solution in theory, works perfectly for what I need, which is to create a new copied version of a dataframe while excluding certain nested structfields. here is a minimally reproducible artifact of my issue: >>> df.printSchema() root | -- big: array(nullable=true) | | -- element: struct(containsNull=true) | | | -- keep: string(nullable=true) | | | -- delete: string(nullable=true) which you can instantiate like such: schema = StructType([StructField("big", ArrayType(StructType([

Spark SQL - 1 task running for long time due to null values is join key

阅读更多关于 Spark SQL - 1 task running for long time due to null values is join key

问题 I am performing a left join between two tables with 1.3 billion records each however the join key is null in table1 for approx 600 million records and because of this all null records get allocated to 1 task ,hence data skew happens making this 1 task to run for hours. from pyspark.sql import SparkSession spark = SparkSession.builder.appName("report").enableHiveSupport() tbl1 = spark.sql("""select a.col1,b.col2,a.Col3 from table1 a left join table2 b on a.col1 = b.col2""") tbl1.write.mode(

Using LSH in spark to run nearest neighbors query on every point in dataframe

阅读更多关于 Using LSH in spark to run nearest neighbors query on every point in dataframe

问题 I need k nearest neighbors for each feature vector in the dataframe. I'm using BucketedRandomProjectionLSHModel from pyspark. code for creating the model brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",seed=12345, bucketLength=n) model = brp.fit(data_df) df_lsh = model.transform(data_df) Now, How do I run approx nearest neighbor query for each point in data_df. I have tried broadcasting the model but got pickle error. Also, defining a udf to access the model gives

PySpark sql compare records on each day and report the differences

阅读更多关于 PySpark sql compare records on each day and report the differences

问题 so the problem I have is I have this dataset: and it shows the businesses are doing business in the specific days. what i want to achieve is to report which businesses are added on what day. Perhaps Im lookign for some answer as: I managed to tide up all the records using this sql: select [Date] ,Mnemonic ,securityDesc ,sum(cast(TradedVolume as money)) as TradedVolumSum FROM SomeTable group by [Date],Mnemonic,securityDesc but I dont know how to compare each days record with the other day and

Remove all rows that are duplicates with respect to some rows

阅读更多关于 Remove all rows that are duplicates with respect to some rows

问题 I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame: +------+-----+----+ | id|value|type| +------+-----+----+ |283924| 1.5| 0| |283924| 1.5| 1| |982384| 3.0| 0| |982384| 3.0| 1| |892383| 2.0| 0| |892383| 2.5| 1| +------+-----+----+ I want to identify duplicates by just the "id" and "value" columns, and then remove all instances. In this case: Rows 1 and 2 are duplicates (again we are ignoring the "type" column) Rows 3 and 4 are

How to serialize PySpark GroupedData object?

阅读更多关于 How to serialize PySpark GroupedData object?

问题 I am running a groupBy() on a dataset having several millions of records and want to save the resulting output (a PySpark GroupedData object) so that I can de-serialize it later and resume from that point (running aggregations on top of that as needed). df.groupBy("geo_city") <pyspark.sql.group.GroupedData at 0x10503c5d0> I want to avoid converting the GroupedData object into DataFrames or RDDs in order to save it to text file or Parquet/Avro format (as the conversion operation is expensive).