问题

I have a csv file with more than 100 columns and 3500 rows which looks like this (just an example):

import pandas as pd

data = pd.DataFrame(data={
    'Profit': [90, -70, 111, 40, -5, -1],
    'Crit1': [True, True, False, True, False, True],
    'Crit2': [False, False, False, True, True, False],
    'Crit3': [True, True, False, True, True, True],
    'Crit4': [False, True, True, False, False, False],
    'Crit5': [True, False, False, True, True, True]
})

I'd like to define 3 results:

1 - totalProfit: is the sum of the column "Profit"

2 - posValues: how many positive values in the column result

3 - negValues: how many negative values in the column result

totalProfit = data['Profit'].sum() # The sum is 165

In this example, posValues will be 3 and negValues will be 3. But I don't know how to count them with a formula.

I'd like to find the best combination of the columns filtered (true / false) to increase the posValues and decrease the negValues, while maximising the totalProfit.

By guess, I think that the best combination is : Crit1 and Crit5 set to True

print(data[(data.Crit5 == True) & (data.Crit1 == True)])

totalResult = data['Profit'][(data.Crit5 == True)  & (data.Crit1 == True)].sum()

print(totalResult)

So, with this combination, we'll have totalResult = 129, posValues = 2 and negValues = 1 and the combination is : Set Crit1 and Crit5 to True

Please keep in mind that, it's not mandatory to filter all the columns, I can have some unfiltered (as in the example).

How can I have a code that will increase the posValues and decrease the negValues, while maximising the totalProfit, and display what combination of True/False columns is the best, please ?

回答1:

TL;DR

We present here an efficient algorithm that, while still potentially having a worst-case that is exponential time (although a formal analysis might prove that even the worst-case is much better than that), in practice has O(n^2) median time and is practical for hundreds of columns. As a test, the solution for a random n=120 (selector columns) x nrows=3500 df was found in 15min31s on a single core.

Warning: fairly long answer below.

A better algorithmic solution

While my other answer focused on the trivial question of how to use pandas for this problem, it only provided a brute-force algorithm (simple and appropriate for small number of columns). In this answer, I address the computational complexity and how to cut it down to handle the kind of scale the OP mentioned (more than 100 columns, thousands of rows).

My initial intuition was that the problem may be NP-hard. The brute force approach has exponential complexity (it explicitly considers the O(2^n) combinations of n columns). That clearly is infeasible for even medium-size n. Other classic problems fall into that category, e.g. the Travelling Salesman and the deceptively simple Subset sum.

There is no obvious way to decide that a partial solution in the search path of all combinations is going to lead to the optimum. In fact, observing the solutions found by brute-force, ordered by descending score, shows that that their arrangement can be quite intricate. Here is a heat map of all column combinations for a randomly generated df with n=9 (and 500 rows). The rows have been sorted by the original df.profit descending. Columns are all the 512 solutions, ordered from best (left) to worst (right). The content of the heatmap is the resulting effective filter (or-ing all boolean columns selected by that solution), True shown as black and False as white:

We can, however, find ways to avoid exploring the full combinations by using an algorithm described in the next section.

Beam search algorithm

A Beam search is "a heuristic algorithm that explores a graph by expanding the most promising node in a limited set."

We use this approach to maintain a set of candidates (each a promising column subset) that are worthy of exploration. While often used as a heuristic to find reasonably good solutions (but not necessarily the best), in our case we only discard paths that can provably not lead to the optimum. As such, as long as we let the algorithm complete, we are guaranteed to deliver the best solution. We also observed that interrupting the search partway still yields excellent solutions, and often the optimum itself.

Definition of a candidate

A candidate is a particular column subset (a selection of columns to use in filtering df) and has a score corresponding to the profit it yields. In addition to being a solution on its own, it can also be used as prefix for "descendant" candidates, by merging it with another candidate. For example, a candidate {a,c} can be merged with another {d,z} to produce a new candidate {a,c,d,z}.

For a candidate x, we store the following quantities:

x.k: the column set
x.p: indicator of rows where profit > 0
x.n: indicator of rows where profit < 0 (we disregard profit == 0)
x.tot: sum(profit | (x.p + x.n)) (total profit for that candidate)
x.totp: sum(profit | x.p) (sum of positive-only profit for that candidate)

Cutting the exploration path

Some observations lead us to aggressively cut down the exploration path:

The maximum profit that a candidate x and all its descendants can ever hope to achieve is x.totp: when adding columns in the column set, one could perhaps reduce the set of negative rows, but never increase the set of positive ones. So x.totp is the upper bound of any profit down the exploration path of this candidate.
Each candidate should be uniquely visited: after visiting {a,b} by merging {a} with {b}, we could inadvertently revisit the same candidate by merging {b} with {a}. Thus, when merging x with y, we require that all columns in x.k are strictly smaller than every column y.k (max(x.k) < min(y.k)).
When considering a merge of two candidates, there is no opportunity for improvement if either set of negative columns in one is a subset of the other: reject if x.n < y.n or y.n < x.n (< here being the 'is strict subset' set operator).

This leads to the following implementation:

class candidate(namedtuple('Cand', 'tot totp k p n')):
    """
    A candidate solution or partial solution.
    
    k: (colset) set of columns (e.g. {'a', 'b', 'c'})
    tot: sum(profit | (p | n)) (sum of profit for all rows selected by this colset)
    totp: sum(profit | p) (sum of profit for positive-only rows selected by this colset)
    p: bool np.array indicating where profit > 0 for this colset
    n: bool np.array indicating where profit < 0 for this colset
    """
    def name(self):
        cols = ''.join(sorted(self.k))
        return cols

    def __str__(self):
        cols = ''.join(sorted(self.k))
        return f'{cols} ({self.tot:.2f}, max {self.totp:.2f}, '
               f'|p| = {self.p.sum()}, |n| = {self.n.sum()})'

    def __repr__(self):
        return str(self)

def make_candidate(df, k):
    truth = df[k].values if k else np.ones(df.shape[0], dtype=bool)
    xk = frozenset({k} if k else {})  # frozenset can be used as dict key, if needed
    xp = (truth & (df['profit'] > 0)).values
    xn = (truth & (df['profit'] < 0)).values
    xtotp = df.loc[xp, 'profit'].sum()
    xtot = xtotp + df.loc[xn, 'profit'].sum()
    return candidate(xtot, xtotp, xk, xp, xn)    

def merge(beam, x, y):
    """merge two candidates x, y if deemed viable, else return None"""
    if max(x.k) >= min(y.k):
        return None  # avoid visiting same colset several times
    if (x.totp < y.tot or y.totp < x.tot:
        return None  # z could never best x or y
    zn = x.n * y.n  # intersection
    zp = x.p * y.p
    ztotp = beam.df.loc[zp, 'profit'].sum()
    if ztotp < beam.best.tot or ztotp <= x.tot or ztotp <= y.tot:
        return None  # z could never best the beam's best so far, or x or y
    ztot = ztotp + beam.df.loc[zn, 'profit'].sum()
    z = candidate(ztot, ztotp, x.k.union(y.k), zp, zn)
    return z

class Beam:
    def __init__(self, df, best, singles, sol):
        self.df = df
        self.best = best
        self.singles = singles
        self.sol = sol
        self.loops = 0
    
    @classmethod
    def from_df(cls, df):
        cols = [k for k in df.columns if k != 'profit']
        # make solutions, first: empty set, then single-column ones
        oa = make_candidate(df, None)
        singles = [make_candidate(df, k) for k in cols]
        best = max([oa] + singles, key=lambda x: x.tot)
        singles = [x for x in singles if x.totp > best.tot]
        return cls(df, best, singles, singles.copy())

    def add_candidate(self, z):
        if z is None:
            return False
        self.sol.append(z)
        if z.tot > self.best.tot:
            self.best = z
            self.prune()
        return True

    def prune(self):
        """remove solutions that cannot become better than the current best"""
        self.sol = [x for x in self.sol if x.totp >= self.best.tot]

    def __str__(self):
        return f'Beam: best: {self.best}, |sol| = {len(self.sol)}, ' \
               f'|singles| = {len(self.singles)}, loops = {self.loops}'

    def __repr__(self):
        return str(self)
    
    def optimize(self, max_iters=None, report_freq=None):
        i = 0
        while self.sol and (max_iters is None or i < max_iters):
            if report_freq is not None and i % report_freq == 0:
                print(f'loop {i:5d}, beam = {self}')
            x = self.sol.pop(0)
            for y in self.singles:
                self.add_candidate(merge(self, x, y))
            i += 1
            self.loops += 1
        if report_freq:
            print(f'done {i:5d}, beam = {self}')

For experimentation, we can generate random frames:

def gen_colnames():
    for n in count(1):
        for t in combinations(ascii_lowercase, n):
            yield ''.join(t)

def colnames(n):
    if n > len(ascii_lowercase):
        return [f'c{k}' for k in range(n)]
    else:
        return list(islice(gen_colnames(), n))

def generate_df(ncols, nrows = 500):
    df = pd.DataFrame({'profit': np.random.normal(scale=100, size=nrows)})
    cols = pd.DataFrame(np.random.choice([False, True], size=(nrows, ncols)), columns=colnames(ncols))
    df = pd.concat([df, cols], axis=1)
    return df

And then we can test it:

df = generate_df(15)

%%time
res = brute_force_all(df)
# CPU times: user 40.9 s, sys: 68 ms, total: 41 s

%%time
beam = Beam.from_df(df)
beam.optimize()
# CPU times: user 165 ms, sys: 3 µs, total: 165 ms

# verify that we have the same solution as the brute-force
assert beam.best.k == res.colset.iloc[0]

# summary of the beam result:
beam
# Beam: best: bf (2567.64, max 5930.26, |p| = 68, |n| = 51), |sol| = 0, |singles| = 15, loops = 228

Performance

We do not provide a formal analysis of the time or space complexity of our solution (we'll leave that as an exercise for the reader...), but here are some empirical observations obtained on a series of runs on random frames of various sizes:

This log-log plot shows individual run times (points) and median times (lines) for both the brute-force and the beam search. As expected, the brute force time is exponential. The beam search is considerably faster, and the median times appear to be polynomial with n (a straight line on the log-log plot). A reasonably good fit for the median time is obtained with a degree-2 polynom (so, O(n^2)):

Note that, up to size n=64, in thousands of runs we have never hit the maximum budget we had set max_iters = 100_000. The solutions were therefore all optimal.

Note also that there is a long-tail dispersion of the actual time, depending on the complexity of the search surface. Here is the individual times histogram for n = 64:

But this is mitigated by the ability to stop the search early. In fact, our observations are that, even in the longest cases, very good solutions (or often the optimal itself) appear quite early in the search.

回答2:

This is only a partial answer and doesn't address your optimization question, so please don't accept it at this time.

To calculate how many positive and negative values are in the 'Profit' column you can use the following code:


posValues = (data['Profit'] > 0).sum()
zeroValues = (data['Profit'] == 0).sum()
negValues = (data['Profit'] < 0).sum()

Or you can define a function to compute this information:


def sign_counts(values):
    positive = (values > 0).sum()
    zero = (values == 0).sum()
    negative = (values < 0).sum()
    return negative, zero, positive

negative, zero, positive = sign_counts(data['Profit'])



def nonpositive_positive_counts(values):
    positive = (values > 0).sum()
    nonpositive = values.size - positive
    return nonpositive, positive

nonpositive, positive = nonpositive_positive_counts(data['Profit'])

回答3:

Brute force (original) answer

This answer focuses on the mundane question of how to use pandas (and numpy) to find solutions to the problem. It is naive from an algorithmic complexity perspective, and indeed is O(2^n) as it evaluates all possible combinations of columns. As such, it is Exponential time.

See my other answer for a Beam search solution that has median time O(n^2).

In this naive approach, we will:

express all combinations of columns (not including Profit);
define a filtered function that filters the data frame based on a given column set;
define a metrics function that returns a tuple (sum(Profit), posCount, -negCount);
compute the metrics for all combinations, and assemble into a df;
sort the df by the metrics tuples.

from itertools import combinations

def metrics(s):
    # returns three quantities on a Series s: sum, poscount, -negcount
    return s.sum(), (s > 0).sum(), -(s < 0).sum()

def filtered(df, combo):
    # given a combo: set of columns, filter the df to keep
    # the rows where all the columns are True
    mask = np.all(df[combo], axis=1)
    return df.loc[mask]

def brute_force_all(df):
    """
    Return all brute-force solutions. O(2^n).
    """
    # get all columns (except for 'profit') combinations
    crit_cols = [k for k in df.columns if k != 'profit']
    combos = [set(combo) for n in range(0, len(crit_cols) + 1)
              for combo in combinations(crit_cols, n)]

    # assemble a df made of metrics and colset
    res = pd.DataFrame([
        metrics(filtered(df, combo)['profit']) + (combo,)
        for combo in combos
    ], columns='total poscount negcount colset'.split())

    # finally, sort to expose the "best" result first
    res = res.sort_values(['total', 'poscount', 'negcount'], ascending=False)
    res = res.reset_index(drop=True)
    return res

Example on your data:

    total  poscount  negcount                               colset
0     165         3        -3                                   {}
9     129         2        -1                       {Crit5, Crit1}
20    129         2        -1                {Crit5, Crit1, Crit3}
5     124         2        -2                              {Crit5}
14    124         2        -2                       {Crit5, Crit3}
  ...
29      0         0         0         {Crit5, Crit2, Crit4, Crit3}
30      0         0         0  {Crit2, Crit4, Crit5, Crit1, Crit3}
7     -70         0        -1                       {Crit1, Crit4}
12    -70         0        -1                       {Crit3, Crit4}
18    -70         0        -1                {Crit3, Crit1, Crit4}

Details:

In order to understand the code above, it is a good idea to inspect some of the quantities we set out to compute. For example:

>>> combos
[set(),
 {'Crit1'},
 ...
 {'Crit5'},
 {'Crit1', 'Crit2'},
 ...
 {'Crit4', 'Crit5'},
 {'Crit1', 'Crit2', 'Crit3'},
 ...
 {'Crit3', 'Crit4', 'Crit5'},
 {'Crit1', 'Crit2', 'Crit3', 'Crit4'},
 ...
 {'Crit2', 'Crit3', 'Crit4', 'Crit5'},
 {'Crit1', 'Crit2', 'Crit3', 'Crit4', 'Crit5'}]

# metrics on the unfiltered (whole) data:
>>> metrics(data['Profit'])
(165, 3, -3)

# data filtered where Crit2 and Crit3 are True:
>>> filtered(data, {'Crit2', 'Crit3'})
   Profit  Crit1  Crit2  Crit3  Crit4  Crit5
3      40   True   True   True  False   True
4      -5  False   True   True  False   True

# metrics on the above:
>>> metrics(filtered(data, {'Crit2', 'Crit3'})['Profit'])
(35, 1, -1)

回答4:

Based on all the code of @PierreD:

import pandas as pd
import numpy as np
from collections import namedtuple
from itertools import combinations

df = pd.DataFrame(data={
    'profit': [90, -70, 111, 40, -5, -1],
    'Crit1': [True, True, False, True, False, True],
    'Crit2': [False, False, False, True, True, False],
    'Crit3': [True, True, False, True, True, True],
    'Crit4': [False, True, True, False, False, False],
    'Crit5': [True, False, False, True, True, True]
})

def metrics(s):
    # returns three quantities on a Series s: sum, poscount, -negcount
    return s.sum(), (s > 0).sum(), -(s < 0).sum()

def filtered(df, combo):
    # given a combo: set of columns, filter the df to keep
    # the rows where all the columns are True
    mask = np.all(df[combo], axis=1)
    return df.loc[mask]

def brute_force_all(df):
    """
    Return all brute-force solutions. O(2^n).
    """
    # get all columns (except for 'profit') combinations
    crit_cols = [k for k in df.columns if k != 'profit']
    combos = [set(combo) for n in range(0, len(crit_cols) + 1)
              for combo in combinations(crit_cols, n)]

    # assemble a df made of metrics and colset
    res = pd.DataFrame([
        metrics(filtered(df, combo)['profit']) + (combo,)
        for combo in combos
    ], columns='total poscount negcount colset'.split())

    # finally, sort to expose the "best" result first
    res = res.sort_values(['total', 'poscount', 'negcount'], ascending=False)
    res = res.reset_index(drop=True)
    return res

class candidate(namedtuple('Cand', 'tot totp k p n')):
    """
    A candidate solution or partial solution.
    
    k: (colset) set of columns (e.g. {'a', 'b', 'c'})
    tot: sum(profit | (p | n)) (sum of profit for all rows selected by this colset)
    totp: sum(profit | p) (sum of profit for positive-only rows selected by this colset)
    p: bool np.array indicating where profit > 0 for this colset
    n: bool np.array indicating where profit < 0 for this colset
    """
    def name(self):
        cols = ''.join(sorted(self.k))
        return cols

    def __str__(self):
        cols = ''.join(sorted(self.k))
        return f'{cols} ({self.tot:.2f}, max {self.totp:.2f}, 'f'|p| = {self.p.sum()}, |n| = {self.n.sum()})'

    def __repr__(self):
        return str(self)

def make_candidate(df, k):
    truth = df[k].values if k else np.ones(df.shape[0], dtype=bool)
    xk = frozenset({k} if k else {})  # frozenset can be used as dict key, if needed
    xp = (truth & (df['profit'] > 0)).values
    xn = (truth & (df['profit'] < 0)).values
    xtotp = df.loc[xp, 'profit'].sum()
    xtot = xtotp + df.loc[xn, 'profit'].sum()
    return candidate(xtot, xtotp, xk, xp, xn)    

def merge(beam, x, y):
    """merge two candidates x, y if deemed viable, else return None"""
    if max(x.k) >= min(y.k):
        return None  # avoid visiting same colset several times
    if (x.totp < y.tot or y.totp < x.tot):
        return None  # z could never best x or y
    zn = x.n * y.n  # intersection
    zp = x.p * y.p
    ztotp = beam.df.loc[zp, 'profit'].sum()
    if ztotp < beam.best.tot or ztotp <= x.tot or ztotp <= y.tot:
        return None  # z could never best the beam's best so far, or x or y
    ztot = ztotp + beam.df.loc[zn, 'profit'].sum()
    z = candidate(ztot, ztotp, x.k.union(y.k), zp, zn)
    return z

class Beam:
    def __init__(self, df, best, singles, sol):
        self.df = df
        self.best = best
        self.singles = singles
        self.sol = sol
        self.loops = 0
    
    @classmethod
    def from_df(cls, df):
        cols = [k for k in df.columns if k != 'profit']
        # make solutions, first: empty set, then single-column ones
        oa = make_candidate(df, None)
        singles = [make_candidate(df, k) for k in cols]
        best = max([oa] + singles, key=lambda x: x.tot)
        singles = [x for x in singles if x.totp > best.tot]
        return cls(df, best, singles, singles.copy())

    def add_candidate(self, z):
        if z is None:
            return False
        self.sol.append(z)
        if z.tot > self.best.tot:
            self.best = z
            self.prune()
        return True

    def prune(self):
        """remove solutions that cannot become better than the current best"""
        self.sol = [x for x in self.sol if x.totp >= self.best.tot]

    def __str__(self):
        return f'Beam: best: {self.best}, |sol| = {len(self.sol)}, ' \
               f'|singles| = {len(self.singles)}, loops = {self.loops}'

    def __repr__(self):
        return str(self)
    
    def optimize(self, max_iters=None, report_freq=None):
        i = 0
        while self.sol and (max_iters is None or i < max_iters):
            if report_freq is not None and i % report_freq == 0:
                print(f'loop {i:5d}, beam = {self}')
            x = self.sol.pop(0)
            for y in self.singles:
                self.add_candidate(merge(self, x, y))
            i += 1
            self.loops += 1
        if report_freq:
            print(f'done {i:5d}, beam = {self}')

%%time
res = brute_force_all(df)

%%time
beam = Beam.from_df(df)
beam.optimize()

# verify that we have the same solution as the brute-force
assert beam.best.k == res.colset.iloc[0]

# summary of the beam result:
beam

With this code, I have : Beam: best: (165.00, max 241.00, |p| = 3, |n| = 3), |sol| = 0, |singles| = 0, loops = 0

来源：https://stackoverflow.com/questions/65345340/maximizing-optimizing-3-results-at-the-same-time

标签

python