问题
I am brand new to Python, seems like it has a lot of flexibility and is faster than traditional RDBMS systems.
Working on a very simple process to create random fantasy teams. I come from an RDBMS background (Oracle SQL) and that does not seem to be optimal for this data processing.
I made a dataframe using pandas read from csv file and now have a simple dataframe with two columns -- Player, Salary:
` Name Salary
0 Jason Day 11700
1 Dustin Johnson 11600
2 Rory McIlroy 11400
3 Jordan Spieth 11100
4 Henrik Stenson 10500
5 Phil Mickelson 10200
6 Justin Rose 9800
7 Adam Scott 9600
8 Sergio Garcia 9400
9 Rickie Fowler 9200`
What I am trying to do via python (pandas) is produce all combinations of 6 players which salary is between a certain amount 45000 -- 50000.
In looking up python options, I found the itertools combination interesting, but it would result a massive list of combinations without filtering the sum of salary.
In traditional SQL, I would do a massive merge cartesian join w/ SUM, but then I get the players in different spots..
Such as A, B, C then, C, B, A..
My traditional SQL which doesn't work well enough is something like this:
` SELECT distinct
ONE.name AS "1",
TWO.name AS "2",
THREE.name AS "3",
FOUR.name AS "4",
FIVE.name AS "5",
SIX.name AS "6",
sum(one.salary + two.salary + three.salary + four.salary + five.salary + six.salary) as salary
FROM
nl.pgachamp2 ONE,nl.pgachamp2 TWO,nl.pgachamp2 THREE, nl.pgachamp2 FOUR,nl.pgachamp2 FIVE,nl.pgachamp2 SIX
where ONE.name != TWO.name
and ONE.name != THREE.name
and one.name != four.name
and one.name != five.name
and TWO.name != THREE.name
and TWO.name != four.name
and two.name != five.name
and TWO.name != six.name
and THREE.name != four.name
and THREE.name != five.name
and three.name != six.name
and five.name != six.name
and four.name != six.name
and four.name != five.name
and one.name != six.name
group by ONE.name, TWO.name, THREE.name, FOUR.name, FIVE.name, SIX.name`
Is there a way to do this in Pandas/Python?
Any documentation that can be pointed to would be great!
回答1:
Here's a non-Pandas solution using a simple algorithm. It generates combinations recursively from a list of players sorted by salary. This lets it skip generating combinations that exceed the salary cap.
As piRSquared mentions, there are no teams of 6 that fall within the salary limits stated in the question, so I chose limits to generate a small number of teams.
#!/usr/bin/env python3
''' Limited combinations
Generate combinations of players whose combined salaries fall within given limits
See http://stackoverflow.com/q/38636460/4014959
Written by PM 2Ring 2016.07.28
'''
data = '''\
0 Jason Day 11700
1 Dustin Johnson 11600
2 Rory McIlroy 11400
3 Jordan Spieth 11100
4 Henrik Stenson 10500
5 Phil Mickelson 10200
6 Justin Rose 9800
7 Adam Scott 9600
8 Sergio Garcia 9400
9 Rickie Fowler 9200
'''
data = [s.split() for s in data.splitlines()]
all_players = [(' '.join(u[1:-1]), int(u[-1])) for u in data]
all_players.sort(key=lambda t: t[1])
for i, row in enumerate(all_players):
print(i, row)
print('- '*40)
def choose_teams(free, num, team=(), value=0):
num -= 1
for i, p in enumerate(free):
salary = all_players[p][1]
newvalue = value + salary
if newvalue <= hi:
newteam = team + (p,)
if num == 0:
if newvalue >= lo:
yield newteam, newvalue
else:
yield from choose_teams(free[i+1:], num, newteam, newvalue)
else:
break
#Salary limits
lo, hi = 55000, 60500
#Indices of players that can be chosen for a team
free = tuple(range(len(all_players)))
for i, (t, s) in enumerate(choose_teams(free, 6), 1):
team = [all_players[p] for p in t]
names, sals = zip(*team)
assert sum(sals) == s
print(i, t, names, s)
output
0 ('Rickie Fowler', 9200)
1 ('Sergio Garcia', 9400)
2 ('Adam Scott', 9600)
3 ('Justin Rose', 9800)
4 ('Phil Mickelson', 10200)
5 ('Henrik Stenson', 10500)
6 ('Jordan Spieth', 11100)
7 ('Rory McIlroy', 11400)
8 ('Dustin Johnson', 11600)
9 ('Jason Day', 11700)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 (0, 1, 2, 3, 4, 5) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson') 58700
2 (0, 1, 2, 3, 4, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Jordan Spieth') 59300
3 (0, 1, 2, 3, 4, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Rory McIlroy') 59600
4 (0, 1, 2, 3, 4, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Dustin Johnson') 59800
5 (0, 1, 2, 3, 4, 9) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Jason Day') 59900
6 (0, 1, 2, 3, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Jordan Spieth') 59600
7 (0, 1, 2, 3, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Rory McIlroy') 59900
8 (0, 1, 2, 3, 5, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Dustin Johnson') 60100
9 (0, 1, 2, 3, 5, 9) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Jason Day') 60200
10 (0, 1, 2, 3, 6, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Jordan Spieth', 'Rory McIlroy') 60500
11 (0, 1, 2, 4, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60000
12 (0, 1, 2, 4, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Rory McIlroy') 60300
13 (0, 1, 2, 4, 5, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Dustin Johnson') 60500
14 (0, 1, 3, 4, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60200
15 (0, 1, 3, 4, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Rory McIlroy') 60500
16 (0, 2, 3, 4, 5, 6) ('Rickie Fowler', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60400
If you're using an older version of Python that doesn't support the yield from
syntax you can replace
yield from choose_teams(free[i+1:], num, newteam, newvalue)
with
for t, v in choose_teams(free[i+1:], num, newteam, newvalue):
yield t, v
回答2:
I ran this for combinations of 6 and found no teams that satisfied. I used 5 instead.
This should get you there:
from itertools import combinations
import pandas as pd
s = df.set_index('Name').squeeze()
combos = pd.DataFrame([c for c in combinations(s.index, 5)])
combo_salary = combos.apply(lambda x: s.ix[x].sum(), axis=1)
combos[(combo_salary >= 45000) & (combo_salary <= 50000)]
回答3:
As mentioned in the comments, this is a constraint satisfaction problem. It has a combinatorial part but since you defined no objectives to minimize or maximize, it is not an optimization problem (yet). You can approach this problems in many ways: you can try brute-force like piRSquared or use a heuristic algorithm like PM 2Ring. I will present a solution with 0-1 linear programming, and use PuLP library to model and solve the problem.
from pulp import *
import pandas as pd
df = df.set_index('Name')
feasible_solutions = []
In order to model the problem, first you need to define decision variables. Here, the decision variable will be an indicator variable for each player: it will be 1 if that player is selected, 0 otherwise. Here's how you do it in PuLP:
players = LpVariable.dicts('player', df.index.tolist(), 0, 1, LpInteger)
Next, you create a problem:
prob = pulp.LpProblem('Team Selection', pulp.LpMinimize)
As I mentioned earlier, your question does not state any objectives. You only want to create all possible teams. Therefore, we will define an arbitrary objective function (I will turn back to this arbitrary function again).
prob += 0
You mainly have two constraints:
1) The team will have 5 players:
prob += lpSum([players[player] for player in players]) == 5
Remember that players dictionary stores our decision variables. players[player]
is either 1 (if that player is in the team) or 0 (otherwise). Therefore, if you sum all of them, the result should be equal to 5.
2) Total salary should be between 45k and 50k.
prob += lpSum([players[player] * df.at[player, 'Salary']
for player in players]) <= 50000
prob += lpSum([players[player] * df.at[player, 'Salary']
for player in players]) >= 45000
This is similar to the first constraint. Here, we are not counting but instead summing the salaries (when the player is in the team, the value will be 1 so it will be multiplied by the corresponding salary. Otherwise, the value will be zero and the multiplication will also be zero).
The main modeling is done here. If you call prob.solve()
, it will find a solution that satisfies these constraints. Normally, in optimization problems, we provide an objective function and try to maximize or minimize that. For example, assume that you have scores for the skills of players. Your budget is limited, you cannot go ahead and select top 5 players. So, in the part where we stated prob += 0
, you can define an objective function to maximize total skill score. But that wasn't what you were looking for so let's continue.
Once you find a solution, you can add another constraint to the problem stating that the next solution should be different than this, you can generate all solutions.
while prob.solve() == 1:
current_solution = [player for player in players if value(players[player])]
feasible_solutions.append(current_solution)
prob += lpSum([players[player] for player in current_solution]) <= 4
prob.solve()
is the method that solves the problem. Based on the result, it returns an integer. If it finds an optimal solution, the result is 1. For infeasible or unbounded solutions there are different codes. So as long as we can find new solutions, we continue the loop.
In the loop we first append the current solution to our feasible_solutions
list. Then, we add another constraint: for these 5 players, the sum of the variables cannot exceed 4 (The largest value 5 and if it is 5, we know that this is the same solution).
If you run this, you will have the same result piRSquared got.
So, what is the advantage of this?
The main reason we use integer/binary linear programming is that number of combinations grows really quickly. This is called combinatorial explosion. Take a look at the number of possible teams (without any constraint):
from scipy.misc import comb
comb(10, 5)
Out: 252.0
comb(20, 5)
Out: 15504.0
comb(50, 5)
Out: 2118760.0
comb(100, 5)
Out: 75287520.0
It becomes almost impossible to evaluate all combinations.
Of course when you want to list all combinations that satisfies those constraints you are still running that risk. If the number of combinations satisfying the constraints is large, it will take a lot of time to compute. You cannot avoid that. However, if that subset is small or it is still large but you are evaluating a function on that set, it will be much better.
For example, consider the following DataFrame:
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'Name': ['player' + str(i).zfill(3) for i in range(100)],
'Salary': np.random.randint(0, 9600, 100)})
268 of 75287520 solutions satisfy the salary constraint. It took 44 seconds in my computer to list them. It would take hours to find them using brute-force (update: it takes 8 hours and 21 minutes).
PuLP uses an open source solver, Cbc, by default. There are other open source/commercial alternative solvers that you can use with PuLP. Commercial ones are generally faster as expected (they are very expensive though).
Another alternative is constraint programming as I mentioned in the comments. For these kind of problems, you can find many smart ways to reduce the search space with constraint programming. I am comfortable with integer programming so I showed a model based on that but I should note that constraint programming might be way better for this.
来源:https://stackoverflow.com/questions/38636460/python-pandas-self-join-for-merge-cartesian-product-to-produce-all-combinations