How do I reduce a spark dataframe to a maximum amount of rows for each value in a column?

问题

I need to reduce a datafame and export it to a parquet. I need to make sure that I have ex. 10000 rows for each value in a column.

The dataframe I am working with looks like the following:

+-------------+-------------------+
|         Make|              Model|
+-------------+-------------------+
|      PONTIAC|           GRAND AM|
|        BUICK|            CENTURY|
|        LEXUS|             IS 300|
|MERCEDES-BENZ|           SL-CLASS|
|      PONTIAC|           GRAND AM|
|       TOYOTA|              PRIUS|
|   MITSUBISHI|      MONTERO SPORT|
|MERCEDES-BENZ|          SLK-CLASS|
|       TOYOTA|              CAMRY|
|         JEEP|           WRANGLER|
|    CHEVROLET|     SILVERADO 1500|
|       TOYOTA|             AVALON|
|         FORD|             RANGER|
|MERCEDES-BENZ|            C-CLASS|
|       TOYOTA|             TUNDRA|
|         FORD|EXPLORER SPORT TRAC|
|    CHEVROLET|           COLORADO|
|   MITSUBISHI|            MONTERO|
|        DODGE|      GRAND CARAVAN|
+-------------+-------------------+

I need to return at most 10,000 rows for each model:

+--------------------+-------+
|               Model|  count|
+--------------------+-------+
|                 MDX|1658647|
|               ASTRO| 682657|
|           ENTOURAGE|  72622|
|             ES 300H|  80712|
|            6 SERIES| 145252|
|           GRAN FURY|   9719|
|RANGE ROVER EVOQU...|   4290|
|        LEGACY WAGON|   2070|
|        LEGACY SEDAN|    104|
|  DAKOTA CHASSIS CAB|      8|
|              CAMARO|2028678|
|                  XT|  10009|
|             DYNASTY| 171776|
|                 944|  43044|
|         F430 SPIDER|    506|
|FLEETWOOD SEVENTY...|      6|
|         MONTE CARLO|1040806|
|             LIBERTY|2415456|
|            ESCALADE| 798832|
| SIERRA 3500 CLASSIC|   9541|
+--------------------+-------+

This question is not the same because it, as others have suggested below, only retrieves rows where a value is greater than other values. I want for each value in df['Model']: limit rows for that value(model) to 10,000 if there are 10,000 or more rows (Pseudo-code obviously). In other words, if there are more than 10,000 rows, get rid of the rest, otherwise leave all rows.

回答1:

I guess you should put row_number with window, orderBy, and partitionBy to query the result and then you can filter with your limit. For example, getting a random shuffle and limiting the sample to 10,000 rows per value is demonstrated by the following:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

window = Window.partitionBy(df['Model']).orderBy(F.rand())
df = df.select(F.col('*'), 
               F.row_number().over(window).alias('row_number')) \
               .where(F.col('row_number') <= 10000)

回答2:

If I understand your question you want to sample few rows (e.g 10000) but these records should have count greater to 10000. If I understand your question, this is the answer:

df = df.groupBy('Make', 'Model').agg(count(lit(1)).alias('count'))
df = df.filter(df['count']>10000).select('Model','count')
df.write.parquet('output.parquet')

回答3:

Simply do

import pyspark.sql.functions as F

df = df.groupBy("Model").agg(F.count(F.lit(1)).alias("Count"))
df = df.filter(df["Count"] < 10000).select("Model", "Count")

df.write.parquet("data.parquet")

回答4:

I will modify the given problem slightly so that it can be visualized here, by reducing the maximum number of rows for each distinct value to 2 rows (instead of 10,000).

Sample dataframe:

df = spark.createDataFrame(
  [('PONTIAC', 'GRAND AM'), ('BUICK', 'CENTURY'), ('LEXUS', 'IS 300'), ('MERCEDES-BENZ', 'SL-CLASS'), ('PONTIAC', 'GRAND AM'), ('TOYOTA', 'PRIUS'), ('MITSUBISHI', 'MONTERO SPORT'), ('MERCEDES-BENZ', 'SLK-CLASS'), ('TOYOTA', 'CAMRY'), ('JEEP', 'WRANGLER'), ('MERCEDES-BENZ', 'SL-CLASS'), ('PONTIAC', 'GRAND AM'), ('TOYOTA', 'PRIUS'), ('MITSUBISHI', 'MONTERO SPORT'), ('MERCEDES-BENZ', 'SLK-CLASS'), ('TOYOTA', 'CAMRY'), ('JEEP', 'WRANGLER'), ('CHEVROLET', 'SILVERADO 1500'), ('TOYOTA', 'AVALON'), ('FORD', 'RANGER'), ('MERCEDES-BENZ', 'C-CLASS'), ('TOYOTA', 'TUNDRA'), ('TOYOTA', 'PRIUS'), ('MITSUBISHI', 'MONTERO SPORT'), ('MERCEDES-BENZ', 'SLK-CLASS'), ('TOYOTA', 'CAMRY'), ('JEEP', 'WRANGLER'), ('CHEVROLET', 'SILVERADO 1500'), ('TOYOTA', 'AVALON'), ('FORD', 'RANGER'), ('MERCEDES-BENZ', 'C-CLASS'), ('TOYOTA', 'TUNDRA'), ('FORD', 'EXPLORER SPORT TRAC'), ('CHEVROLET', 'COLORADO'), ('MITSUBISHI', 'MONTERO'), ('DODGE', 'GRAND CARAVAN')],
  ['Make', 'Model']
)

Let's do a row count:

df.groupby('Model').count().collect()

+-------------------+-----+
|              Model|count|
+-------------------+-----+
|             AVALON|    2|
|            CENTURY|    1|
|             TUNDRA|    2|
|           WRANGLER|    3|
|           GRAND AM|    3|
|EXPLORER SPORT TRAC|    1|
|            C-CLASS|    2|
|      MONTERO SPORT|    3|
|              CAMRY|    3|
|      GRAND CARAVAN|    1|
|     SILVERADO 1500|    2|
|              PRIUS|    3|
|            MONTERO|    1|
|           COLORADO|    1|
|             RANGER|    2|
|          SLK-CLASS|    3|
|           SL-CLASS|    2|
|             IS 300|    1|
+-------------------+-----+

If I understand your question correctly, you can assign a row number to each row with a partition by Model:

from pyspark.sql import Window
from pyspark.sql.functions import row_number, desc

win_1 = Window.partitionBy('Model').orderBy(desc('Make'))
df = df.withColumn('row_num', row_number().over(win_1))

And then filter the rows to where row_num <= 2:

df = df.filter(df.row_num <= 2).select('Make', 'Model')

There should be a total of 2+1+2+2+2+1+2+2+2+1+2+2+1+1+2+2+2+1 = 30 rows

Final results:

+-------------+-------------------+
|         Make|              Model|
+-------------+-------------------+
|       TOYOTA|             AVALON|
|       TOYOTA|             AVALON|
|        BUICK|            CENTURY|
|       TOYOTA|             TUNDRA|
|       TOYOTA|             TUNDRA|
|         JEEP|           WRANGLER|
|         JEEP|           WRANGLER|
|      PONTIAC|           GRAND AM|
|      PONTIAC|           GRAND AM|
|         FORD|EXPLORER SPORT TRAC|
|MERCEDES-BENZ|            C-CLASS|
|MERCEDES-BENZ|            C-CLASS|
|   MITSUBISHI|      MONTERO SPORT|
|   MITSUBISHI|      MONTERO SPORT|
|       TOYOTA|              CAMRY|
|       TOYOTA|              CAMRY|
|        DODGE|      GRAND CARAVAN|
|    CHEVROLET|     SILVERADO 1500|
|    CHEVROLET|     SILVERADO 1500|
|       TOYOTA|              PRIUS|
|       TOYOTA|              PRIUS|
|   MITSUBISHI|            MONTERO|
|    CHEVROLET|           COLORADO|
|         FORD|             RANGER|
|         FORD|             RANGER|
|MERCEDES-BENZ|          SLK-CLASS|
|MERCEDES-BENZ|          SLK-CLASS|
|MERCEDES-BENZ|           SL-CLASS|
|MERCEDES-BENZ|           SL-CLASS|
|        LEXUS|             IS 300|
+-------------+-------------------+

来源：https://stackoverflow.com/questions/59673909/how-do-i-reduce-a-spark-dataframe-to-a-maximum-amount-of-rows-for-each-value-in

标签

apache-spark

pyspark

pyspark-dataframes