pyspark-sql

how to sort value before concatenate text columns in pyspark

纵然是瞬间 提交于 2020-04-07 08:00:13
问题 I need help to convert below code in Pyspark code or Pyspark sql code. df["full_name"] = df.apply(lambda x: "_".join(sorted((x["first"], x["last"]))), axis=1) Its basically adding one new column name full_name which have to concatenate values of the columns first and last in a sorted way. I have done below code but don't know how to apply to sort in a columns text value. df= df.withColumn('full_name', f.concat(f.col('first'),f.lit('_'), f.col('last'))) 回答1: From Spark-2.4+ : We can use array

How Count unique ID after groupBy in pyspark

て烟熏妆下的殇ゞ 提交于 2020-04-05 15:41:49
问题 I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year. from pyspark.sql.functions import col import pyspark.sql.functions as fn gr = Df2.groupby(['Year']) df_grouped = gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year')) The result is : [students by year][1] The problem that I discovered that so many ID's are repeated So the result is wrong and huge. I want to agregate the students by year, count the total

Fill in missing values based on series and populate second row based on previous or next row in pyspark

萝らか妹 提交于 2020-03-25 17:50:14
问题 I have a csv with 4 columns. The file contains some missing rows based on the series. Input:- No A B C 1 10 50 12 3 40 50 12 4 20 60 15 6 80 80 18 Output:- No A B C 1 10 50 12 2 10 50 12 3 40 50 12 4 20 60 15 5 20 60 15 6 80 80 18 I need pyspark code to generate the above output. 来源: https://stackoverflow.com/questions/60681807/fill-in-missing-values-based-on-series-and-populate-second-row-based-on-previous

Compare two datasets in pyspark

ぐ巨炮叔叔 提交于 2020-03-04 15:34:23
问题 I have 2 datasets. Example Dataset 1: id | model | first_name | last_name ----------------------------------------------------------- 1234 | 32 | 456765 | [456700,987565] ----------------------------------------------------------- 4539 | 20 | 123211 | [893456,123456] ----------------------------------------------------------- Some times one of the columns first_name and last_name is empty. Example dataset 2: number | matricule | name | model ---------------------------------------------------

How to convert multiple parquet files into TFrecord files using SPARK?

不羁岁月 提交于 2020-02-28 17:24:08
问题 I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy() . I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy() operation. Therefore, I have not found another way than to try to work in two steps: Repartion the dataframe according to my condition, using partitionBy() and write the resulting partitions to parquet files. Read those parquet files to

How to extract a single (column/row) value from a dataframe using PySpark?

不羁的心 提交于 2020-02-25 22:43:31
问题 Here's my spark code. It works fine and returns 2517. All I want to do is to print "2517 degrees"...but I'm not sure how to extract that 2517 into a variable. I can only display the dataframe but not extract values from it. Sounds super easy but unfortunately I'm stuck! Any help will be appreciated. Thanks! df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("dbfs:/databricks-datasets/power-plant/data") df

How to extract a single (column/row) value from a dataframe using PySpark?

好久不见. 提交于 2020-02-25 22:43:13
问题 Here's my spark code. It works fine and returns 2517. All I want to do is to print "2517 degrees"...but I'm not sure how to extract that 2517 into a variable. I can only display the dataframe but not extract values from it. Sounds super easy but unfortunately I'm stuck! Any help will be appreciated. Thanks! df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("dbfs:/databricks-datasets/power-plant/data") df

Spark aggregations where output columns are functions and rows are columns

…衆ロ難τιáo~ 提交于 2020-02-25 05:06:45
问题 I want to compute a bunch of different agg functions on different columns in a dataframe. I know I can do something like this, but the output is all one row. df.agg(max("cola"), min("cola"), max("colb"), min("colb")) Let's say I will be performing 100 different aggregations on 10 different columns. I want the output dataframe to be like this - |Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc.. cola | 1 | 10| ... colb | 2 | NULL| ... colc | 5 | 20| ... cold | NULL | 42| ... ... Where my

Spark aggregations where output columns are functions and rows are columns

扶醉桌前 提交于 2020-02-25 05:06:27
问题 I want to compute a bunch of different agg functions on different columns in a dataframe. I know I can do something like this, but the output is all one row. df.agg(max("cola"), min("cola"), max("colb"), min("colb")) Let's say I will be performing 100 different aggregations on 10 different columns. I want the output dataframe to be like this - |Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc.. cola | 1 | 10| ... colb | 2 | NULL| ... colc | 5 | 20| ... cold | NULL | 42| ... ... Where my

How to generate hourly timestamps between two dates in PySpark?

一世执手 提交于 2020-02-25 04:14:09
问题 Consider this sample dataframe data = [(dt.datetime(2000,1,1,15,20,37), dt.datetime(2000,1,1,19,12,22))] df = spark.createDataFrame(data, ["minDate", "maxDate"]) df.show() +-------------------+-------------------+ | minDate| maxDate| +-------------------+-------------------+ |2000-01-01 15:20:37|2000-01-01 19:12:22| +-------------------+-------------------+ I would like to explode those two dates into an hourly time-series like +-------------------+-------------------+ | minDate| maxDate| +--