问题
I need to reduce a datafame and export it to a parquet. I need to make sure that I have ex. 10000 rows for each value in a column.
The dataframe I am working with looks like the following:
+-------------+-------------------+
| Make| Model|
+-------------+-------------------+
| PONTIAC| GRAND AM|
| BUICK| CENTURY|
| LEXUS| IS 300|
|MERCEDES-BENZ| SL-CLASS|
| PONTIAC| GRAND AM|
| TOYOTA| PRIUS|
| MITSUBISHI| MONTERO SPORT|
|MERCEDES-BENZ| SLK-CLASS|
| TOYOTA| CAMRY|
| JEEP| WRANGLER|
| CHEVROLET| SILVERADO 1500|
| TOYOTA| AVALON|
| FORD| RANGER|
|MERCEDES-BENZ| C-CLASS|
| TOYOTA| TUNDRA|
| FORD|EXPLORER SPORT TRAC|
| CHEVROLET| COLORADO|
| MITSUBISHI| MONTERO|
| DODGE| GRAND CARAVAN|
+-------------+-------------------+
I need to return at most 10,000 rows for each model:
+--------------------+-------+
| Model| count|
+--------------------+-------+
| MDX|1658647|
| ASTRO| 682657|
| ENTOURAGE| 72622|
| ES 300H| 80712|
| 6 SERIES| 145252|
| GRAN FURY| 9719|
|RANGE ROVER EVOQU...| 4290|
| LEGACY WAGON| 2070|
| LEGACY SEDAN| 104|
| DAKOTA CHASSIS CAB| 8|
| CAMARO|2028678|
| XT| 10009|
| DYNASTY| 171776|
| 944| 43044|
| F430 SPIDER| 506|
|FLEETWOOD SEVENTY...| 6|
| MONTE CARLO|1040806|
| LIBERTY|2415456|
| ESCALADE| 798832|
| SIERRA 3500 CLASSIC| 9541|
+--------------------+-------+
This question is not the same because it, as others have suggested below, only retrieves rows where a value is greater than other values. I want for each value in df['Model']: limit rows for that value(model) to 10,000 if there are 10,000 or more rows
(Pseudo-code obviously). In other words, if there are more than 10,000 rows, get rid of the rest, otherwise leave all rows.
回答1:
I guess you should put row_number
with window
, orderBy
, and partitionBy
to query the result and then you can filter with your limit. For example, getting a random shuffle and limiting the sample to 10,000 rows per value is demonstrated by the following:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
window = Window.partitionBy(df['Model']).orderBy(F.rand())
df = df.select(F.col('*'),
F.row_number().over(window).alias('row_number')) \
.where(F.col('row_number') <= 10000)
回答2:
If I understand your question you want to sample few rows (e.g 10000) but these records should have count greater to 10000. If I understand your question, this is the answer:
df = df.groupBy('Make', 'Model').agg(count(lit(1)).alias('count'))
df = df.filter(df['count']>10000).select('Model','count')
df.write.parquet('output.parquet')
回答3:
Simply do
import pyspark.sql.functions as F
df = df.groupBy("Model").agg(F.count(F.lit(1)).alias("Count"))
df = df.filter(df["Count"] < 10000).select("Model", "Count")
df.write.parquet("data.parquet")
回答4:
I will modify the given problem slightly so that it can be visualized here, by reducing the maximum number of rows for each distinct value to 2 rows (instead of 10,000).
Sample dataframe:
df = spark.createDataFrame(
[('PONTIAC', 'GRAND AM'), ('BUICK', 'CENTURY'), ('LEXUS', 'IS 300'), ('MERCEDES-BENZ', 'SL-CLASS'), ('PONTIAC', 'GRAND AM'), ('TOYOTA', 'PRIUS'), ('MITSUBISHI', 'MONTERO SPORT'), ('MERCEDES-BENZ', 'SLK-CLASS'), ('TOYOTA', 'CAMRY'), ('JEEP', 'WRANGLER'), ('MERCEDES-BENZ', 'SL-CLASS'), ('PONTIAC', 'GRAND AM'), ('TOYOTA', 'PRIUS'), ('MITSUBISHI', 'MONTERO SPORT'), ('MERCEDES-BENZ', 'SLK-CLASS'), ('TOYOTA', 'CAMRY'), ('JEEP', 'WRANGLER'), ('CHEVROLET', 'SILVERADO 1500'), ('TOYOTA', 'AVALON'), ('FORD', 'RANGER'), ('MERCEDES-BENZ', 'C-CLASS'), ('TOYOTA', 'TUNDRA'), ('TOYOTA', 'PRIUS'), ('MITSUBISHI', 'MONTERO SPORT'), ('MERCEDES-BENZ', 'SLK-CLASS'), ('TOYOTA', 'CAMRY'), ('JEEP', 'WRANGLER'), ('CHEVROLET', 'SILVERADO 1500'), ('TOYOTA', 'AVALON'), ('FORD', 'RANGER'), ('MERCEDES-BENZ', 'C-CLASS'), ('TOYOTA', 'TUNDRA'), ('FORD', 'EXPLORER SPORT TRAC'), ('CHEVROLET', 'COLORADO'), ('MITSUBISHI', 'MONTERO'), ('DODGE', 'GRAND CARAVAN')],
['Make', 'Model']
)
Let's do a row count:
df.groupby('Model').count().collect()
+-------------------+-----+
| Model|count|
+-------------------+-----+
| AVALON| 2|
| CENTURY| 1|
| TUNDRA| 2|
| WRANGLER| 3|
| GRAND AM| 3|
|EXPLORER SPORT TRAC| 1|
| C-CLASS| 2|
| MONTERO SPORT| 3|
| CAMRY| 3|
| GRAND CARAVAN| 1|
| SILVERADO 1500| 2|
| PRIUS| 3|
| MONTERO| 1|
| COLORADO| 1|
| RANGER| 2|
| SLK-CLASS| 3|
| SL-CLASS| 2|
| IS 300| 1|
+-------------------+-----+
If I understand your question correctly, you can assign a row number to each row with a partition by Model
:
from pyspark.sql import Window
from pyspark.sql.functions import row_number, desc
win_1 = Window.partitionBy('Model').orderBy(desc('Make'))
df = df.withColumn('row_num', row_number().over(win_1))
And then filter the rows to where row_num <= 2
:
df = df.filter(df.row_num <= 2).select('Make', 'Model')
There should be a total of 2+1+2+2+2+1+2+2+2+1+2+2+1+1+2+2+2+1 = 30 rows
Final results:
+-------------+-------------------+
| Make| Model|
+-------------+-------------------+
| TOYOTA| AVALON|
| TOYOTA| AVALON|
| BUICK| CENTURY|
| TOYOTA| TUNDRA|
| TOYOTA| TUNDRA|
| JEEP| WRANGLER|
| JEEP| WRANGLER|
| PONTIAC| GRAND AM|
| PONTIAC| GRAND AM|
| FORD|EXPLORER SPORT TRAC|
|MERCEDES-BENZ| C-CLASS|
|MERCEDES-BENZ| C-CLASS|
| MITSUBISHI| MONTERO SPORT|
| MITSUBISHI| MONTERO SPORT|
| TOYOTA| CAMRY|
| TOYOTA| CAMRY|
| DODGE| GRAND CARAVAN|
| CHEVROLET| SILVERADO 1500|
| CHEVROLET| SILVERADO 1500|
| TOYOTA| PRIUS|
| TOYOTA| PRIUS|
| MITSUBISHI| MONTERO|
| CHEVROLET| COLORADO|
| FORD| RANGER|
| FORD| RANGER|
|MERCEDES-BENZ| SLK-CLASS|
|MERCEDES-BENZ| SLK-CLASS|
|MERCEDES-BENZ| SL-CLASS|
|MERCEDES-BENZ| SL-CLASS|
| LEXUS| IS 300|
+-------------+-------------------+
来源:https://stackoverflow.com/questions/59673909/how-do-i-reduce-a-spark-dataframe-to-a-maximum-amount-of-rows-for-each-value-in