pyspark-sql | 易学教程

how to sort value before concatenate text columns in pyspark

阅读更多关于 how to sort value before concatenate text columns in pyspark

问题 I need help to convert below code in Pyspark code or Pyspark sql code. df["full_name"] = df.apply(lambda x: "_".join(sorted((x["first"], x["last"]))), axis=1) Its basically adding one new column name full_name which have to concatenate values of the columns first and last in a sorted way. I have done below code but don't know how to apply to sort in a columns text value. df= df.withColumn('full_name', f.concat(f.col('first'),f.lit('_'), f.col('last'))) 回答1: From Spark-2.4+ : We can use array

How Count unique ID after groupBy in pyspark

阅读更多关于 How Count unique ID after groupBy in pyspark

问题 I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year. from pyspark.sql.functions import col import pyspark.sql.functions as fn gr = Df2.groupby(['Year']) df_grouped = gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year')) The result is : [students by year][1] The problem that I discovered that so many ID's are repeated So the result is wrong and huge. I want to agregate the students by year, count the total

Fill in missing values based on series and populate second row based on previous or next row in pyspark

阅读更多关于 Fill in missing values based on series and populate second row based on previous or next row in pyspark

问题 I have a csv with 4 columns. The file contains some missing rows based on the series. Input:- No A B C 1 10 50 12 3 40 50 12 4 20 60 15 6 80 80 18 Output:- No A B C 1 10 50 12 2 10 50 12 3 40 50 12 4 20 60 15 5 20 60 15 6 80 80 18 I need pyspark code to generate the above output. 来源： https://stackoverflow.com/questions/60681807/fill-in-missing-values-based-on-series-and-populate-second-row-based-on-previous

Compare two datasets in pyspark

阅读更多关于 Compare two datasets in pyspark

How to convert multiple parquet files into TFrecord files using SPARK?

阅读更多关于 How to convert multiple parquet files into TFrecord files using SPARK?

问题 I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy() . I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy() operation. Therefore, I have not found another way than to try to work in two steps: Repartion the dataframe according to my condition, using partitionBy() and write the resulting partitions to parquet files. Read those parquet files to

How to extract a single (column/row) value from a dataframe using PySpark?

阅读更多关于 How to extract a single (column/row) value from a dataframe using PySpark?

问题 Here's my spark code. It works fine and returns 2517. All I want to do is to print "2517 degrees"...but I'm not sure how to extract that 2517 into a variable. I can only display the dataframe but not extract values from it. Sounds super easy but unfortunately I'm stuck! Any help will be appreciated. Thanks! df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("dbfs:/databricks-datasets/power-plant/data") df

How to extract a single (column/row) value from a dataframe using PySpark?

阅读更多关于 How to extract a single (column/row) value from a dataframe using PySpark?

Spark aggregations where output columns are functions and rows are columns

阅读更多关于 Spark aggregations where output columns are functions and rows are columns

问题 I want to compute a bunch of different agg functions on different columns in a dataframe. I know I can do something like this, but the output is all one row. df.agg(max("cola"), min("cola"), max("colb"), min("colb")) Let's say I will be performing 100 different aggregations on 10 different columns. I want the output dataframe to be like this - |Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc.. cola | 1 | 10| ... colb | 2 | NULL| ... colc | 5 | 20| ... cold | NULL | 42| ... ... Where my

Spark aggregations where output columns are functions and rows are columns

阅读更多关于 Spark aggregations where output columns are functions and rows are columns

How to generate hourly timestamps between two dates in PySpark?

阅读更多关于 How to generate hourly timestamps between two dates in PySpark?

问题 Consider this sample dataframe data = [(dt.datetime(2000,1,1,15,20,37), dt.datetime(2000,1,1,19,12,22))] df = spark.createDataFrame(data, ["minDate", "maxDate"]) df.show() +-------------------+-------------------+ | minDate| maxDate| +-------------------+-------------------+ |2000-01-01 15:20:37|2000-01-01 19:12:22| +-------------------+-------------------+ I would like to explode those two dates into an hourly time-series like +-------------------+-------------------+ | minDate| maxDate| +--