Pandas: Filter dataframe for values that are too frequent or too rare

问题

On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number.

But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval.

import pandas as pd
import string
import numpy as np
vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in 
    'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval']
df = pd.DataFrame(dict(vals))
>> df.head()
    city of destination     city of origin  distance, type of transport (air/car/foot)  time of day, price-interval
0   f   p   a   n
1   k   b   a   f
2   q   s   n   j
3   h   c   g   u
4   w   d   m   h

If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if time of day = night occurs only 3% of the time, or if foot mode of transport is rare, and so on.

I want to remove all such values from all columns (or a list of columns). One idea I have is to do a value_counts on every column, transform and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?

回答1:

This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.

This answer is similar to that provided by @Ami Tavory, but with a few subtle differences:

It normalizes the value counts so you can just use a percentile threshold.
It calculates counts just once per column instead of twice. This results in faster execution.

Code:

threshold = 0.03
for col in df:
    counts = df[col].value_counts(normalize=True)
    df = df.loc[df[col].isin(counts[counts > threshold].index), :]

Code timing:

df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True), 
                   columns=list('ABCD'))

%%timeit df=df2.copy()
threshold = 0.03
for col in df:
    counts = df[col].value_counts(normalize=True)
    df = df.loc[df[col].isin(counts[counts > threshold].index), :]

1 loops, best of 3: 485 ms per loop

%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
    df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]

1 loops, best of 3: 688 ms per loop

回答2:

I would go with one of the following:

Option A

m = 0.03 * len(df)
df[np.all(
    df.apply(
        lambda c: c.isin(c.value_counts()[c.value_counts() > m].index).as_matrix()), 
    axis=1)]

Explanation:

m = 0.03 * len(df) is the threshold (it's nice to take the constant out of the complicated expression)
df[np.all(..., axis=1)] retains the rows where some condition was obtained across all columns.
df.apply(...).as_matrix applies a function to all columns, and makes a matrix of the results.
c.isin(...) checks, for each column item, whether it is in some set.
c.value_counts()[c.value_counts() > m].index is the set of all values in a column whose count is above m.

Option B

m = 0.03 * len(df)
for c in df.columns:
    df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]

The explanation is similar to the one above.

Tradeoffs:

Personally, I find B more readable.
B creates a new DataFrame for each filtering of a column; for large DataFrames, it's probably more expensive.

回答3:

I am new to Python and using Pandas. I came up with the following solution below. Maybe other people might have a better or more efficient approach.

Assuming your DataFrame is DF, you can use the following code below to filter out all infrequent values. Just be sure to update the col and bin_freq variable. DF_Filtered is your new filtered DataFrame.

# Column you want to filter
col = 'time of day'

# Set your frequency to filter out. Currently set to 5%  
bin_freq = float(5)/float(100)

DF_Filtered = pd.DataFrame()

for i in DF[col].unique():
    counts = DF[DF[col]==i].count()[col] 
    total_counts = DF[col].count()
    freq  = float(counts)/float(total_counts)

    if freq > bin_freq:
        DF_Filtered = pd.concat([DF[DF[col]==i],DF_Filtered])

print DF_Filtered

回答4:

DataFrames support clip_lower(threshold, axis=None) and clip_upper(threshold, axis=None), which remove all values below or above (respectively) a certain threshhold.

来源：https://stackoverflow.com/questions/31303946/pandas-filter-dataframe-for-values-that-are-too-frequent-or-too-rare

标签

python

pandas

filtering

selection