问题
I have two data frames with different size
df1
YearDeci Year Month Day ... Magnitude Lat Lon
0 1551.997260 1551 12 31 ... 7.5 34.00 74.50
1 1661.997260 1661 12 31 ... 7.5 34.00 75.00
2 1720.535519 1720 7 15 ... 6.5 28.37 77.09
3 1734.997260 1734 12 31 ... 7.5 34.00 75.00
4 1777.997260 1777 12 31 ... 7.7 34.00 75.00
and
df2
YearDeci Year Month Day Hour ... Seconds Mb Lat Lon
0 1669.510753 1669 6 4 0 ... 0 NaN 33.400 73.200
1 1720.535519 1720 7 15 0 ... 0 NaN 28.700 77.200
2 1780.000000 1780 0 0 0 ... 0 NaN 35.000 77.000
3 1803.388014 1803 5 22 15 ... 0 NaN 30.600 78.600
4 1803.665753 1803 9 1 0 ... 0 NaN 30.300 78.800
5 1803.388014 1803 5 22 15 ... 0 NaN 30.600 78.600.
1.I wanted to compare df1 and df2 based on the column 'YearDeci'. and find out the common entries and unique entries(rows other than common rows).
2.output the common rows(with respect to df2) in df1 based on column 'YearDeci'.
3.output the unique rows(with respect to df2) in df1 based on column 'YearDeci'.
*NB: Difference in decimal values up to +/-0.0001 in the 'YearDeci' is tolerable
The expected output is like
row_common=
YearDeci Year Month Day ... Mb Lat Lon
2 1720.535519 1720 7 15 ... 6.5 28.37 77.09
row_unique=
YearDeci Year Month Day ... Magnitude Lat Lon
0 1551.997260 1551 12 31 ... 7.5 34.00 74.50
1 1661.997260 1661 12 31 ... 7.5 34.00 75.00
3 1734.997260 1734 12 31 ... 7.5 34.00 75.00
4 1777.997260 1777 12 31 ... 7.7 34.00 75.00
回答1:
First compare df1.YearDeci with df2.YearDeci on the "each with each" principle. To perform comparison use np.isclose function with the assumed absolute tolerance.
The result is a boolean array:
- first index - index in df1,
- second index - index in df2.
Then, using np.argwhere, find indices of True values, i.e. indices of "correlated" rows from df1 and df2 and create a DateFrame from them.
The code to perform the above operations is:
ind = pd.DataFrame(np.argwhere(np.isclose(df1.YearDeci[:, np.newaxis],
df2.YearDeci[np.newaxis, :], atol=0.0001, rtol=0)),
columns=['ind1', 'ind2'])
Then, having pairs of indices pointing to "correlated" rows in both DataFrames, perform the following merge:
result = ind.merge(df1, left_on='ind1', right_index=True)\
.merge(df2, left_on='ind2', right_index=True, suffixes=['_1', '_2'])
The final step is to drop both "auxiliary index columns" (ind1 and ind2):
result.drop(columns=['ind1', 'ind2'], inplace=True)
The result (divided into 2 parts) is:
YearDeci_1 Year_1 Month_1 Day_1 Magnitude Lat_1 Lon_1 YearDeci_2 \
0 1720.535519 1720 7 15 6.5 28.37 77.09 1720.535519
Year_2 Month_2 Day_2 Hour Seconds Mb Lat_2 Lon_2
0 1720 7 15 0 0 NaN 28.7 77.2
回答2:
The indices of the common rows are already in the variable ind
So to find the unique entries, all we need to do is, drop the common rows from the df1 according to the indices in "ind" So it is better to make another CSV file contain the common entries and read it to a variable.
df1_common = pd.read_csv("df1_common.csv")
df1_uniq = df1.drop(df1.index[ind.ind1])
来源:https://stackoverflow.com/questions/58615048/how-to-compare-two-data-frames-of-different-size-based-on-a-column