问题
I have two csv files.Depending upon the value of a cell in csv file 1 I should be able to search that value in a column of csv file 2 and get he corresponding value from other column in csv file 2. I am sorry if this very confusing.It will probably get clear by illustration
CSV file 1
Car Mileage
A 8
B 6
C 10
CSV file 2
Score Mileage(Min) Mileage(Max)
1 1 3
2 4 6
3 7 9
4 10 12
5 13 15
And my desired output CSV file is something like this
Car Mileage Score
A 8 3
B 6 2
C 10 4
Car A is given a score of 3 depending upon its mileage 8 and then looking that mileage in csv file 2 in what range it falls and then getting corresponding score value for that range. Any help will be appreciated Thanks in advance
回答1:
As of writing this, the current stable release is v0.21.
To read your files, use pd.read_csv -
df0 = pd.read_csv('file1.csv')
df1 = pd.read_csv('file2.csv')
df0
Car Mileage
0 A 8
1 B 6
2 C 10
df1
Score Mileage(Min) Mileage(Max)
0 1 1 3
1 2 4 6
2 3 7 9
3 4 10 12
4 5 13 15
To find the Score, use pd.IntervalIndex by calling IntervalIndex.from_tuples
. This should be really fast -
v = df1.loc[:, 'Mileage(Min)':'Mileage(Max)'].apply(tuple, 1).tolist()
idx = pd.IntervalIndex.from_tuples(v, closed='both') # you can also use `from_arrays`
df0['Score'] = df1.iloc[idx.get_indexer(df0.Mileage.values), 'Score'].values
df0
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
Other methods of creating an IntervalIndex
are outlined here.
To write your result, use pd.DataFrame.to_csv -
df0.to_csv('file3.csv')
Here's a high level outline of what I've done here.
- First, read in your CSV files
- Use
pd.IntervalIndex
to build an interval index tree. So, searching is now logarithmic in complexity. - Use idx.get_indexer to find the index of each value in the tree
- Use the index to locate the
Score
value indf1
, and assign this back todf0
. Note that I call.values
, otherwise, the values will be misaligned when assigning back. - Write your result back to CSV
For more information on Intervalindex
, take a look at this SO Q/A - Finding matching interval(s) in pandas Intervalindex
Note that IntervalIndex
is new in v0.20
, so if you have an older version, make sure you update your version with
pip install --upgrade pandas
回答2:
You can use IntervalIndex, new in version 0.20.0+
:
First create DataFrames by read_csv:
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
Create IntervalIndex
by from_arrays:
s = pd.IntervalIndex.from_arrays(df2['Mileage(Min)'], df2['Mileage(Max)'], 'both')
print (s)
IntervalIndex([[1, 3], [4, 6], [7, 9], [10, 12], [13, 15]]
closed='both',
dtype='interval[int64]')
Select Mileage
values by intervalindex and set to new column by array created by values, because else indices are not aligned and get:
TypeError: incompatible index of inserted column with frame index
df1['Score'] = df2.set_index(s).loc[df1['Mileage'], 'Score'].values
print (df1)
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
And last write to file by to_csv:
df1.to_csv('file3.csv', index=False)
回答3:
Setup
data = [(1,1,3), (2,4,6), (3,7,9), (4,10,12), (5,13,15)]
df = pd.DataFrame(data, columns=['Score','MMin','MMax'])
car_data = [('A', 8), ('B', 6), ('C', 10)]
car = pd.DataFrame(car_data, columns=['Car','Mileage'])
def find_score(x, df):
result = -99
for idx, row in df.iterrows():
if x >= row.MMin and x <= row.MMax:
result = row.Score
return result
car['Score'] = car.Mileage.apply(lambda x: find_score(x, df))
Which yields
In [58]: car
Out[58]:
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
来源:https://stackoverflow.com/questions/47941113/searching-a-particular-value-in-a-range-among-two-columns-python-dataframe