问题
I have a csv file with more than 100 columns and 3500 rows which looks like this (just an example):
import pandas as pd
data = pd.DataFrame(data={
'Profit': [90, -70, 111, 40, -5, -1],
'Crit1': [True, True, False, True, False, True],
'Crit2': [False, False, False, True, True, False],
'Crit3': [True, True, False, True, True, True],
'Crit4': [False, True, True, False, False, False],
'Crit5': [True, False, False, True, True, True]
})
I'd like to define 3 results:
1 - totalProfit: is the sum of the column "Profit"
2 - posValues: how many positive values in the column result
3 - negValues: how many negative values in the column result
totalProfit = data['Profit'].sum() # The sum is 165
In this example, posValues will be 3 and negValues will be 3. But I don't know how to count them with a formula.
I'd like to find the best combination of the columns filtered (true / false) to increase the posValues and decrease the negValues, while maximising the totalProfit.
By guess, I think that the best combination is : Crit1 and Crit5 set to True
print(data[(data.Crit5 == True) & (data.Crit1 == True)])
totalResult = data['Profit'][(data.Crit5 == True) & (data.Crit1 == True)].sum()
print(totalResult)
So, with this combination, we'll have totalResult = 129, posValues = 2 and negValues = 1 and the combination is : Set Crit1 and Crit5 to True
Please keep in mind that, it's not mandatory to filter all the columns, I can have some unfiltered (as in the example).
How can I have a code that will increase the posValues and decrease the negValues, while maximising the totalProfit, and display what combination of True/False columns is the best, please ?
回答1:
TL;DR
We present here an efficient algorithm that, while still potentially having a worst-case that is exponential time (although a formal analysis might prove that even the worst-case is much better than that), in practice has O(n^2)
median time and is practical for hundreds of columns. As a test, the solution for a random n=120
(selector columns) x nrows=3500
df
was found in 15min31s on a single core.
Warning: fairly long answer below.
A better algorithmic solution
While my other answer focused on the trivial question of how to use pandas for this problem, it only provided a brute-force algorithm (simple and appropriate for small number of columns). In this answer, I address the computational complexity and how to cut it down to handle the kind of scale the OP mentioned (more than 100 columns, thousands of rows).
My initial intuition was that the problem may be NP-hard. The brute force approach has exponential complexity (it explicitly considers the O(2^n)
combinations of n
columns). That clearly is infeasible for even medium-size n
. Other classic problems fall into that category, e.g. the Travelling Salesman and the deceptively simple Subset sum.
There is no obvious way to decide that a partial solution in the search path of all combinations is going to lead to the optimum. In fact, observing the solutions found by brute-force, ordered by descending score, shows that that their arrangement can be quite intricate. Here is a heat map of all column combinations for a randomly generated df
with n=9
(and 500 rows). The rows have been sorted by the original df.profit
descending. Columns are all the 512 solutions, ordered from best (left) to worst (right). The content of the heatmap is the resulting effective filter (or-ing all boolean columns selected by that solution), True
shown as black and False
as white:
We can, however, find ways to avoid exploring the full combinations by using an algorithm described in the next section.
Beam search algorithm
A Beam search is "a heuristic algorithm that explores a graph by expanding the most promising node in a limited set."
We use this approach to maintain a set of candidates (each a promising column subset) that are worthy of exploration. While often used as a heuristic to find reasonably good solutions (but not necessarily the best), in our case we only discard paths that can provably not lead to the optimum. As such, as long as we let the algorithm complete, we are guaranteed to deliver the best solution. We also observed that interrupting the search partway still yields excellent solutions, and often the optimum itself.
Definition of a candidate
A candidate
is a particular column subset (a selection of columns to use in filtering df
) and has a score corresponding to the profit it yields. In addition to being a solution on its own, it can also be used as prefix for "descendant" candidates, by merging it with another candidate. For example, a candidate {a,c}
can be merged with another {d,z}
to produce a new candidate {a,c,d,z}
.
For a candidate x
, we store the following quantities:
x.k
: the column setx.p
: indicator of rows whereprofit > 0
x.n
: indicator of rows whereprofit < 0
(we disregardprofit == 0
)x.tot
:sum(profit | (x.p + x.n))
(total profit for that candidate)x.totp
:sum(profit | x.p)
(sum of positive-only profit for that candidate)
Cutting the exploration path
Some observations lead us to aggressively cut down the exploration path:
- The maximum profit that a candidate
x
and all its descendants can ever hope to achieve isx.totp
: when adding columns in the column set, one could perhaps reduce the set of negative rows, but never increase the set of positive ones. Sox.totp
is the upper bound of any profit down the exploration path of this candidate. - Each candidate should be uniquely visited: after visiting
{a,b}
by merging{a}
with{b}
, we could inadvertently revisit the same candidate by merging{b}
with{a}
. Thus, when mergingx
withy
, we require that all columns inx.k
are strictly smaller than every columny.k
(max(x.k) < min(y.k)
). - When considering a merge of two candidates, there is no opportunity for improvement if either set of negative columns in one is a subset of the other: reject if
x.n < y.n or y.n < x.n
(<
here being the 'is strict subset' set operator).
This leads to the following implementation:
class candidate(namedtuple('Cand', 'tot totp k p n')):
"""
A candidate solution or partial solution.
k: (colset) set of columns (e.g. {'a', 'b', 'c'})
tot: sum(profit | (p | n)) (sum of profit for all rows selected by this colset)
totp: sum(profit | p) (sum of profit for positive-only rows selected by this colset)
p: bool np.array indicating where profit > 0 for this colset
n: bool np.array indicating where profit < 0 for this colset
"""
def name(self):
cols = ''.join(sorted(self.k))
return cols
def __str__(self):
cols = ''.join(sorted(self.k))
return f'{cols} ({self.tot:.2f}, max {self.totp:.2f}, '
f'|p| = {self.p.sum()}, |n| = {self.n.sum()})'
def __repr__(self):
return str(self)
def make_candidate(df, k):
truth = df[k].values if k else np.ones(df.shape[0], dtype=bool)
xk = frozenset({k} if k else {}) # frozenset can be used as dict key, if needed
xp = (truth & (df['profit'] > 0)).values
xn = (truth & (df['profit'] < 0)).values
xtotp = df.loc[xp, 'profit'].sum()
xtot = xtotp + df.loc[xn, 'profit'].sum()
return candidate(xtot, xtotp, xk, xp, xn)
def merge(beam, x, y):
"""merge two candidates x, y if deemed viable, else return None"""
if max(x.k) >= min(y.k):
return None # avoid visiting same colset several times
if (x.totp < y.tot or y.totp < x.tot:
return None # z could never best x or y
zn = x.n * y.n # intersection
zp = x.p * y.p
ztotp = beam.df.loc[zp, 'profit'].sum()
if ztotp < beam.best.tot or ztotp <= x.tot or ztotp <= y.tot:
return None # z could never best the beam's best so far, or x or y
ztot = ztotp + beam.df.loc[zn, 'profit'].sum()
z = candidate(ztot, ztotp, x.k.union(y.k), zp, zn)
return z
class Beam:
def __init__(self, df, best, singles, sol):
self.df = df
self.best = best
self.singles = singles
self.sol = sol
self.loops = 0
@classmethod
def from_df(cls, df):
cols = [k for k in df.columns if k != 'profit']
# make solutions, first: empty set, then single-column ones
oa = make_candidate(df, None)
singles = [make_candidate(df, k) for k in cols]
best = max([oa] + singles, key=lambda x: x.tot)
singles = [x for x in singles if x.totp > best.tot]
return cls(df, best, singles, singles.copy())
def add_candidate(self, z):
if z is None:
return False
self.sol.append(z)
if z.tot > self.best.tot:
self.best = z
self.prune()
return True
def prune(self):
"""remove solutions that cannot become better than the current best"""
self.sol = [x for x in self.sol if x.totp >= self.best.tot]
def __str__(self):
return f'Beam: best: {self.best}, |sol| = {len(self.sol)}, ' \
f'|singles| = {len(self.singles)}, loops = {self.loops}'
def __repr__(self):
return str(self)
def optimize(self, max_iters=None, report_freq=None):
i = 0
while self.sol and (max_iters is None or i < max_iters):
if report_freq is not None and i % report_freq == 0:
print(f'loop {i:5d}, beam = {self}')
x = self.sol.pop(0)
for y in self.singles:
self.add_candidate(merge(self, x, y))
i += 1
self.loops += 1
if report_freq:
print(f'done {i:5d}, beam = {self}')
For experimentation, we can generate random frames:
def gen_colnames():
for n in count(1):
for t in combinations(ascii_lowercase, n):
yield ''.join(t)
def colnames(n):
if n > len(ascii_lowercase):
return [f'c{k}' for k in range(n)]
else:
return list(islice(gen_colnames(), n))
def generate_df(ncols, nrows = 500):
df = pd.DataFrame({'profit': np.random.normal(scale=100, size=nrows)})
cols = pd.DataFrame(np.random.choice([False, True], size=(nrows, ncols)), columns=colnames(ncols))
df = pd.concat([df, cols], axis=1)
return df
And then we can test it:
df = generate_df(15)
%%time
res = brute_force_all(df)
# CPU times: user 40.9 s, sys: 68 ms, total: 41 s
%%time
beam = Beam.from_df(df)
beam.optimize()
# CPU times: user 165 ms, sys: 3 µs, total: 165 ms
# verify that we have the same solution as the brute-force
assert beam.best.k == res.colset.iloc[0]
# summary of the beam result:
beam
# Beam: best: bf (2567.64, max 5930.26, |p| = 68, |n| = 51), |sol| = 0, |singles| = 15, loops = 228
Performance
We do not provide a formal analysis of the time or space complexity of our solution (we'll leave that as an exercise for the reader...), but here are some empirical observations obtained on a series of runs on random frames of various sizes:
This log-log plot shows individual run times (points) and median times (lines) for both the brute-force and the beam search. As expected, the brute force time is exponential. The beam search is considerably faster, and the median times appear to be polynomial with n
(a straight line on the log-log plot). A reasonably good fit for the median time is obtained with a degree-2 polynom (so, O(n^2)
):
Note that, up to size n=64
, in thousands of runs we have never hit the maximum budget we had set max_iters = 100_000
. The solutions were therefore all optimal.
Note also that there is a long-tail dispersion of the actual time, depending on the complexity of the search surface. Here is the individual times histogram for n = 64
:
But this is mitigated by the ability to stop the search early. In fact, our observations are that, even in the longest cases, very good solutions (or often the optimal itself) appear quite early in the search.
回答2:
This is only a partial answer and doesn't address your optimization question, so please don't accept it at this time.
To calculate how many positive and negative values are in the 'Profit' column you can use the following code:
posValues = (data['Profit'] > 0).sum()
zeroValues = (data['Profit'] == 0).sum()
negValues = (data['Profit'] < 0).sum()
Or you can define a function to compute this information:
def sign_counts(values):
positive = (values > 0).sum()
zero = (values == 0).sum()
negative = (values < 0).sum()
return negative, zero, positive
negative, zero, positive = sign_counts(data['Profit'])
def nonpositive_positive_counts(values):
positive = (values > 0).sum()
nonpositive = values.size - positive
return nonpositive, positive
nonpositive, positive = nonpositive_positive_counts(data['Profit'])
回答3:
Brute force (original) answer
This answer focuses on the mundane question of how to use pandas (and numpy) to find solutions to the problem. It is naive from an algorithmic complexity perspective, and indeed is O(2^n)
as it evaluates all possible combinations of columns. As such, it is Exponential time.
See my other answer for a Beam search solution that has median time O(n^2)
.
In this naive approach, we will:
- express all combinations of columns (not including
Profit
); - define a
filtered
function that filters the data frame based on a given column set; - define a
metrics
function that returns a tuple(sum(Profit), posCount, -negCount)
; - compute the metrics for all combinations, and assemble into a
df
; - sort the
df
by the metrics tuples.
from itertools import combinations
def metrics(s):
# returns three quantities on a Series s: sum, poscount, -negcount
return s.sum(), (s > 0).sum(), -(s < 0).sum()
def filtered(df, combo):
# given a combo: set of columns, filter the df to keep
# the rows where all the columns are True
mask = np.all(df[combo], axis=1)
return df.loc[mask]
def brute_force_all(df):
"""
Return all brute-force solutions. O(2^n).
"""
# get all columns (except for 'profit') combinations
crit_cols = [k for k in df.columns if k != 'profit']
combos = [set(combo) for n in range(0, len(crit_cols) + 1)
for combo in combinations(crit_cols, n)]
# assemble a df made of metrics and colset
res = pd.DataFrame([
metrics(filtered(df, combo)['profit']) + (combo,)
for combo in combos
], columns='total poscount negcount colset'.split())
# finally, sort to expose the "best" result first
res = res.sort_values(['total', 'poscount', 'negcount'], ascending=False)
res = res.reset_index(drop=True)
return res
Example on your data:
total poscount negcount colset
0 165 3 -3 {}
9 129 2 -1 {Crit5, Crit1}
20 129 2 -1 {Crit5, Crit1, Crit3}
5 124 2 -2 {Crit5}
14 124 2 -2 {Crit5, Crit3}
...
29 0 0 0 {Crit5, Crit2, Crit4, Crit3}
30 0 0 0 {Crit2, Crit4, Crit5, Crit1, Crit3}
7 -70 0 -1 {Crit1, Crit4}
12 -70 0 -1 {Crit3, Crit4}
18 -70 0 -1 {Crit3, Crit1, Crit4}
Details:
In order to understand the code above, it is a good idea to inspect some of the quantities we set out to compute. For example:
>>> combos
[set(),
{'Crit1'},
...
{'Crit5'},
{'Crit1', 'Crit2'},
...
{'Crit4', 'Crit5'},
{'Crit1', 'Crit2', 'Crit3'},
...
{'Crit3', 'Crit4', 'Crit5'},
{'Crit1', 'Crit2', 'Crit3', 'Crit4'},
...
{'Crit2', 'Crit3', 'Crit4', 'Crit5'},
{'Crit1', 'Crit2', 'Crit3', 'Crit4', 'Crit5'}]
# metrics on the unfiltered (whole) data:
>>> metrics(data['Profit'])
(165, 3, -3)
# data filtered where Crit2 and Crit3 are True:
>>> filtered(data, {'Crit2', 'Crit3'})
Profit Crit1 Crit2 Crit3 Crit4 Crit5
3 40 True True True False True
4 -5 False True True False True
# metrics on the above:
>>> metrics(filtered(data, {'Crit2', 'Crit3'})['Profit'])
(35, 1, -1)
回答4:
Based on all the code of @PierreD:
import pandas as pd
import numpy as np
from collections import namedtuple
from itertools import combinations
df = pd.DataFrame(data={
'profit': [90, -70, 111, 40, -5, -1],
'Crit1': [True, True, False, True, False, True],
'Crit2': [False, False, False, True, True, False],
'Crit3': [True, True, False, True, True, True],
'Crit4': [False, True, True, False, False, False],
'Crit5': [True, False, False, True, True, True]
})
def metrics(s):
# returns three quantities on a Series s: sum, poscount, -negcount
return s.sum(), (s > 0).sum(), -(s < 0).sum()
def filtered(df, combo):
# given a combo: set of columns, filter the df to keep
# the rows where all the columns are True
mask = np.all(df[combo], axis=1)
return df.loc[mask]
def brute_force_all(df):
"""
Return all brute-force solutions. O(2^n).
"""
# get all columns (except for 'profit') combinations
crit_cols = [k for k in df.columns if k != 'profit']
combos = [set(combo) for n in range(0, len(crit_cols) + 1)
for combo in combinations(crit_cols, n)]
# assemble a df made of metrics and colset
res = pd.DataFrame([
metrics(filtered(df, combo)['profit']) + (combo,)
for combo in combos
], columns='total poscount negcount colset'.split())
# finally, sort to expose the "best" result first
res = res.sort_values(['total', 'poscount', 'negcount'], ascending=False)
res = res.reset_index(drop=True)
return res
class candidate(namedtuple('Cand', 'tot totp k p n')):
"""
A candidate solution or partial solution.
k: (colset) set of columns (e.g. {'a', 'b', 'c'})
tot: sum(profit | (p | n)) (sum of profit for all rows selected by this colset)
totp: sum(profit | p) (sum of profit for positive-only rows selected by this colset)
p: bool np.array indicating where profit > 0 for this colset
n: bool np.array indicating where profit < 0 for this colset
"""
def name(self):
cols = ''.join(sorted(self.k))
return cols
def __str__(self):
cols = ''.join(sorted(self.k))
return f'{cols} ({self.tot:.2f}, max {self.totp:.2f}, 'f'|p| = {self.p.sum()}, |n| = {self.n.sum()})'
def __repr__(self):
return str(self)
def make_candidate(df, k):
truth = df[k].values if k else np.ones(df.shape[0], dtype=bool)
xk = frozenset({k} if k else {}) # frozenset can be used as dict key, if needed
xp = (truth & (df['profit'] > 0)).values
xn = (truth & (df['profit'] < 0)).values
xtotp = df.loc[xp, 'profit'].sum()
xtot = xtotp + df.loc[xn, 'profit'].sum()
return candidate(xtot, xtotp, xk, xp, xn)
def merge(beam, x, y):
"""merge two candidates x, y if deemed viable, else return None"""
if max(x.k) >= min(y.k):
return None # avoid visiting same colset several times
if (x.totp < y.tot or y.totp < x.tot):
return None # z could never best x or y
zn = x.n * y.n # intersection
zp = x.p * y.p
ztotp = beam.df.loc[zp, 'profit'].sum()
if ztotp < beam.best.tot or ztotp <= x.tot or ztotp <= y.tot:
return None # z could never best the beam's best so far, or x or y
ztot = ztotp + beam.df.loc[zn, 'profit'].sum()
z = candidate(ztot, ztotp, x.k.union(y.k), zp, zn)
return z
class Beam:
def __init__(self, df, best, singles, sol):
self.df = df
self.best = best
self.singles = singles
self.sol = sol
self.loops = 0
@classmethod
def from_df(cls, df):
cols = [k for k in df.columns if k != 'profit']
# make solutions, first: empty set, then single-column ones
oa = make_candidate(df, None)
singles = [make_candidate(df, k) for k in cols]
best = max([oa] + singles, key=lambda x: x.tot)
singles = [x for x in singles if x.totp > best.tot]
return cls(df, best, singles, singles.copy())
def add_candidate(self, z):
if z is None:
return False
self.sol.append(z)
if z.tot > self.best.tot:
self.best = z
self.prune()
return True
def prune(self):
"""remove solutions that cannot become better than the current best"""
self.sol = [x for x in self.sol if x.totp >= self.best.tot]
def __str__(self):
return f'Beam: best: {self.best}, |sol| = {len(self.sol)}, ' \
f'|singles| = {len(self.singles)}, loops = {self.loops}'
def __repr__(self):
return str(self)
def optimize(self, max_iters=None, report_freq=None):
i = 0
while self.sol and (max_iters is None or i < max_iters):
if report_freq is not None and i % report_freq == 0:
print(f'loop {i:5d}, beam = {self}')
x = self.sol.pop(0)
for y in self.singles:
self.add_candidate(merge(self, x, y))
i += 1
self.loops += 1
if report_freq:
print(f'done {i:5d}, beam = {self}')
%%time
res = brute_force_all(df)
%%time
beam = Beam.from_df(df)
beam.optimize()
# verify that we have the same solution as the brute-force
assert beam.best.k == res.colset.iloc[0]
# summary of the beam result:
beam
With this code, I have : Beam: best: (165.00, max 241.00, |p| = 3, |n| = 3), |sol| = 0, |singles| = 0, loops = 0
来源:https://stackoverflow.com/questions/65345340/maximizing-optimizing-3-results-at-the-same-time