问题
On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number.
But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval
.
import pandas as pd
import string
import numpy as np
vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in
'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval']
df = pd.DataFrame(dict(vals))
>> df.head()
city of destination city of origin distance, type of transport (air/car/foot) time of day, price-interval
0 f p a n
1 k b a f
2 q s n j
3 h c g u
4 w d m h
If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if time of day = night
occurs only 3% of the time, or if foot
mode of transport is rare, and so on.
I want to remove all such values from all columns (or a list of columns). One idea I have is to do a value_counts
on every column, transform
and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?
回答1:
This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.
This answer is similar to that provided by @Ami Tavory, but with a few subtle differences:
- It normalizes the value counts so you can just use a percentile threshold.
- It calculates counts just once per column instead of twice. This results in faster execution.
Code:
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
Code timing:
df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True),
columns=list('ABCD'))
%%timeit df=df2.copy()
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
1 loops, best of 3: 485 ms per loop
%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
1 loops, best of 3: 688 ms per loop
回答2:
I would go with one of the following:
Option A
m = 0.03 * len(df)
df[np.all(
df.apply(
lambda c: c.isin(c.value_counts()[c.value_counts() > m].index).as_matrix()),
axis=1)]
Explanation:
m = 0.03 * len(df)
is the threshold (it's nice to take the constant out of the complicated expression)df[np.all(..., axis=1)]
retains the rows where some condition was obtained across all columns.df.apply(...).as_matrix
applies a function to all columns, and makes a matrix of the results.c.isin(...)
checks, for each column item, whether it is in some set.c.value_counts()[c.value_counts() > m].index
is the set of all values in a column whose count is abovem
.
Option B
m = 0.03 * len(df)
for c in df.columns:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
The explanation is similar to the one above.
Tradeoffs:
Personally, I find B more readable.
B creates a new DataFrame for each filtering of a column; for large DataFrames, it's probably more expensive.
回答3:
I am new to Python and using Pandas. I came up with the following solution below. Maybe other people might have a better or more efficient approach.
Assuming your DataFrame is DF
, you can use the following code below to filter out all infrequent values. Just be sure to update the col
and bin_freq
variable. DF_Filtered
is your new filtered DataFrame.
# Column you want to filter
col = 'time of day'
# Set your frequency to filter out. Currently set to 5%
bin_freq = float(5)/float(100)
DF_Filtered = pd.DataFrame()
for i in DF[col].unique():
counts = DF[DF[col]==i].count()[col]
total_counts = DF[col].count()
freq = float(counts)/float(total_counts)
if freq > bin_freq:
DF_Filtered = pd.concat([DF[DF[col]==i],DF_Filtered])
print DF_Filtered
回答4:
DataFrames support clip_lower(threshold, axis=None)
and clip_upper(threshold, axis=None)
, which remove all values below or above (respectively) a certain threshhold.
来源:https://stackoverflow.com/questions/31303946/pandas-filter-dataframe-for-values-that-are-too-frequent-or-too-rare