Use KDTree/KNN Return Closest Neighbors

问题

I have two python pandas dataframes. One contains all NFL Quarterbacks' College Football statistics since 2007 and a label on the type of player they are (Elite, Average, Below Average). The other dataframe contains all of the college football qbs' data from this season along with a prediction label.

I want to run some sort of analysis to determine the two closest NFL comparisons for every college football qb based on their labels. I'd like to add to two comparable qbs as two new columns to the second dataframe.

The feature names in both dataframes are the same. Here is what the dataframes look like:

Player     Year    Team    GP    Comp %   YDS    TD   INT     Label
Player A   2020     ASU    12     65.5    3053   25    6     Average

For the example above, I'd like two find the two closest neighbors to Player A that also have the label "Average" from the first dataframe. The way I thought of doing this was to use Scipy's KDTree and run a query tree:

tree = KDTree(nfl[features], leafsize=nfl[features].shape[0]+1)
closest = []

for row in college.iterrows():
    distances, ndx = tree.query(row[features], k=2)
    closest.append(ndx)
print(closest)

However, the print statement returned an empty list. Is this the right way to solve my problem?

回答1:

.iterrows(), will return namedtuples (index, Series) where index is obviously the index of the row, and Series is the features values with the index of those being the columns names (see below).

As you have it, row is being stored as that tuple, so when you have row[features], that won't really do anything. What you're really after is that Series which the features and values Ie row[1]. So you can either call that directly, or just break them up in your loop by doing for idx, row in df.iterrows():. Then you can just call on that Series row.

Scikit learn is a good package here to use (actually built on Scipy so you'll notice same syntax). You'll have to edit the code to your specifications (like filter to only have the "Average" players, maybe you are one-hot encoding the category columns and in that case may need to add that to the features,etc.), but to give you an idea (And I made up these dataframes just for an example...actually the nfl one is accurate, but the college completely made up), you can see below using the kdtree and then taking each row in the college dataframe to see which 2 values it's closest to in the nfl dataframe. I obviously have it print out the names, but as you can see with print(closest), the raw arrays are there for you.

import pandas as pd

nfl = pd.DataFrame([['Tom Brady','1999','Michigan',11,61.0,2217,16,6,'Average'],
                   ['Aaron Rodgers','2004','California',12,66.1,2566,24,8,'Average'],
                   ['Payton Manning','1997','Tennessee',12,60.2,3819,36,11,'Average'],
                   ['Drew Brees','2000','Perdue',12,60.4,3668,26,12,'Average'],
                   ['Dan Marino','1982','Pitt',12,58.5,2432,17,23,'Average'],
                   ['Joe Montana','1978','Notre Dame',11,54.2,2010,10,9,'Average']],
                    columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])


college = pd.DataFrame([['Joe Smith','2019','Illinois',11,55.6,1045,15,7,'Average'],
                   ['Mike Thomas','2019','Wisconsin',11,67,2045,19,11,'Average'],
                   ['Steve Johnson','2019','Nebraska',12,57.3,2345,9,19,'Average']],
                    columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])


features = ['GP','Comp %','YDS','TD','INT']

from sklearn.neighbors import KDTree
tree = KDTree(nfl[features], leaf_size=nfl[features].shape[0]+1)
closest = []

for idx, row in college.iterrows():

    X = row[features].values.reshape(1, -1)
    distances, ndx = tree.query(X, k=2, return_distance=True)
    closest.append(ndx)

    collegePlayer = college.loc[idx,'Player']
    closestPlayers = [ nfl.loc[x,'Player'] for x in ndx[0] ]

    print ('%s closest to: %s' %(collegePlayer, closestPlayers))

print(closest)

Output:

Joe Smith closest to: ['Joe Montana', 'Tom Brady']
Mike Thomas closest to: ['Joe Montana', 'Tom Brady']
Steve Johnson closest to: ['Dan Marino', 'Tom Brady']

来源：https://stackoverflow.com/questions/59401639/use-kdtree-knn-return-closest-neighbors

标签

python-3.x

pandas

scipy

knn

kdtree