问题
I'm trying to predict results of football matches based on earlier results. I'm running Python 3.6 on Windows and using Featuretools 0.4.1.
Let's say I have the following dataframe representing history of results.
Original DataFame
Using the dataframe above I want to create the following dataframe which will be fed to machine learning algorithm as X. Note that goal averages for home and away teams need to be calculated by team despite their past match venues. Is there a way to create such a dataframe using Featuretools?
Resulting Dataframe
Excel file used to simulate the transformation can be found here.
回答1:
This is a tricky feature, but a great usage of a custom primitive in Featuretools.
The first step is load the CSV of matches into a Featuretools entityset
es = ft.EntitySet()
matches_df = pd.read_csv("./matches.csv")
es.entity_from_dataframe(entity_id="matches",
index="match_id",
time_index="match_date",
dataframe=matches_df)
Then we define a custom transform primitive that calculates average goals scored in last n games. it has a parameter that controls the number of past games and whether or not to calculate for the home or away team. Information on defining custom primitives is in our documentation here and here.
from featuretools.variable_types import Numeric, Categorical
from featuretools.primitives import make_trans_primitive
def avg_goals_previous_n_games(home_team, away_team, home_goals, away_goals, which_team=None, n=1):
# make dataframe so it's easier to work with
df = pd.DataFrame({
"home_team": home_team,
"away_team": away_team,
"home_goals": home_goals,
"away_goals": away_goals
})
result = []
for i, current_game in df.iterrows():
# get the right team for this game
team = current_game[which_team]
# find all previous games that have been played
prev_games = df.iloc[:i]
# only get games the team participated in
participated = prev_games[(prev_games["home_team"] == team) | (prev_games["away_team"] == team)]
if participated.shape[0] < n:
result.append(None)
continue
# get last n games
last_n = participated.tail(n)
# calculate games per game
goal_as_home = (last_n["home_team"] == team) * last_n["home_goals"]
goal_as_away = (last_n["away_team"] == team) * last_n["away_goals"]
# calculate mean across all home and away games
mean = (goal_as_home + goal_as_away).mean()
result.append(mean)
return result
# custom function so the name of the feature prints out correctly
def make_name(self):
return "%s_goal_last_%d" % (self.kwargs['which_team'], self.kwargs['n'])
AvgGoalPreviousNGames = make_trans_primitive(function=avg_goals_previous_n_games,
input_types=[Categorical, Categorical, Numeric, Numeric],
return_type=Numeric,
cls_attributes={"generate_name": make_name, "uses_full_entity":True})
Now we can define features using this primitive. In this case, we will have to do it manually.
input_vars = [es["matches"]["home_team"], es["matches"]["away_team"], es["matches"]["home_goals"], es["matches"]["away_goals"]]
home_team_last1 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=1)
home_team_last3 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=3)
home_team_last5 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=5)
away_team_last1 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=1)
away_team_last3 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=3)
away_team_last5 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=5)
features = [home_team_last1, home_team_last3, home_team_last5,
away_team_last1, away_team_last3, away_team_last5]
Finally, we can calculate the feature matrix
fm = ft.calculate_feature_matrix(entityset=es, features=features)
this returns
home_team_goal_last_1 home_team_goal_last_3 home_team_goal_last_5 away_team_goal_last_1 away_team_goal_last_3 away_team_goal_last_5
match_id
1 NaN NaN NaN NaN NaN NaN
2 2.0 NaN NaN 0.0 NaN NaN
3 1.0 NaN NaN 0.0 NaN NaN
4 3.0 1.000000 NaN 0.0 1.000000 NaN
5 1.0 1.333333 NaN 1.0 0.666667 NaN
6 2.0 2.000000 1.2 0.0 0.333333 0.8
7 1.0 0.666667 0.6 2.0 1.666667 1.6
8 2.0 1.000000 0.8 2.0 2.000000 2.0
9 0.0 1.000000 0.8 1.0 1.666667 1.6
10 3.0 2.000000 2.0 1.0 1.000000 0.8
11 3.0 2.333333 2.2 1.0 0.666667 1.0
12 2.0 2.666667 2.2 2.0 1.333333 1.2
Finally, we can also use these manually defined features as an input to the automated feature engineering using Deep Feature Synthesis, which is explained here. By passing the manually defined features in as seed_features
, ft.dfs
will automatically stack on top of them.
fm, feature_defs = ft.dfs(entityset=es,
target_entity="matches",
seed_features=features,
agg_primitives=[],
trans_primitives=["day", "month", "year", "weekday", "percentile"])
feature_defs
is
[<Feature: home_team>,
<Feature: away_team>,
<Feature: home_goals>,
<Feature: away_goals>,
<Feature: label>,
<Feature: home_team_goal_last_1>,
<Feature: home_team_goal_last_3>,
<Feature: home_team_goal_last_5>,
<Feature: away_team_goal_last_1>,
<Feature: away_team_goal_last_3>,
<Feature: away_team_goal_last_5>,
<Feature: DAY(match_date)>,
<Feature: MONTH(match_date)>,
<Feature: YEAR(match_date)>,
<Feature: WEEKDAY(match_date)>,
<Feature: PERCENTILE(home_goals)>,
<Feature: PERCENTILE(away_goals)>,
<Feature: PERCENTILE(home_team_goal_last_1)>,
<Feature: PERCENTILE(home_team_goal_last_3)>,
<Feature: PERCENTILE(home_team_goal_last_5)>,
<Feature: PERCENTILE(away_team_goal_last_1)>,
<Feature: PERCENTILE(away_team_goal_last_3)>,
<Feature: PERCENTILE(away_team_goal_last_5)>]
The feature matrix is
home_team away_team home_goals away_goals label home_team_goal_last_1 home_team_goal_last_3 home_team_goal_last_5 away_team_goal_last_1 away_team_goal_last_3 away_team_goal_last_5 DAY(match_date) MONTH(match_date) YEAR(match_date) WEEKDAY(match_date) PERCENTILE(home_goals) PERCENTILE(away_goals) PERCENTILE(home_team_goal_last_1) PERCENTILE(home_team_goal_last_3) PERCENTILE(home_team_goal_last_5) PERCENTILE(away_team_goal_last_1) PERCENTILE(away_team_goal_last_3) PERCENTILE(away_team_goal_last_5)
match_id
1 Arsenal Chelsea 2 0 1 NaN NaN NaN NaN NaN NaN 1 1 2014 2 0.666667 0.166667 NaN NaN NaN NaN NaN NaN
2 Arsenal Chelsea 1 0 1 2.0 NaN NaN 0.0 NaN NaN 2 1 2014 3 0.333333 0.166667 0.590909 NaN NaN 0.227273 NaN NaN
3 Arsenal Chelsea 0 3 2 1.0 NaN NaN 0.0 NaN NaN 3 1 2014 4 0.125000 0.958333 0.272727 NaN NaN 0.227273 NaN NaN
4 Chelsea Arsenal 1 1 X 3.0 1.000000 NaN 0.0 1.000000 NaN 4 1 2014 5 0.333333 0.500000 0.909091 0.333333 NaN 0.227273 0.500000 NaN
5 Chelsea Arsenal 2 0 1 1.0 1.333333 NaN 1.0 0.666667 NaN 5 1 2014 6 0.666667 0.166667 0.272727 0.555556 NaN 0.590909 0.277778 NaN
6 Chelsea Arsenal 2 1 1 2.0 2.000000 1.2 0.0 0.333333 0.8 6 1 2014 0 0.666667 0.500000 0.590909 0.722222 0.571429 0.227273 0.111111 0.214286
7 Arsenal Chelsea 2 2 X 1.0 0.666667 0.6 2.0 1.666667 1.6 7 1 2014 1 0.666667 0.791667 0.272727 0.111111 0.142857 0.909091 0.833333 0.785714
8 Arsenal Chelsea 0 1 2 2.0 1.000000 0.8 2.0 2.000000 2.0 8 1 2014 2 0.125000 0.500000 0.590909 0.333333 0.357143 0.909091 1.000000 1.000000
9 Arsenal Chelsea 1 3 2 0.0 1.000000 0.8 1.0 1.666667 1.6 9 1 2014 3 0.333333 0.958333 0.090909 0.333333 0.357143 0.590909 0.833333 0.785714
10 Chelsea Arsenal 3 1 1 3.0 2.000000 2.0 1.0 1.000000 0.8 10 1 2014 4 0.916667 0.500000 0.909091 0.722222 0.714286 0.590909 0.500000 0.214286
11 Chelsea Arsenal 2 2 X 3.0 2.333333 2.2 1.0 0.666667 1.0 11 1 2014 5 0.666667 0.791667 0.909091 0.888889 0.928571 0.590909 0.277778 0.428571
12 Chelsea Arsenal 4 1 1 2.0 2.666667 2.2 2.0 1.333333 1.2 12 1 2014 6 1.000000 0.500000 0.590909 1.000000 0.928571 0.909091 0.666667 0.571429
来源:https://stackoverflow.com/questions/53579465/how-to-use-featuretools-to-create-features-from-multiple-columns-in-single-dataf