问题
My goal is to collapse the below table into one single column and this question deals specifically with the blue row below. The table has three categorical variables and 6 analysis/quantitative variables. Columns C1 and C2 are the only variables that need to match for a successful join. All blank cells are NaNs and python code for copying is below. These rows are exported independently because they have information found in other related tables, not included in the export.
Question. (Blue) Some of the quantitative information in the blue row is also in the grey row and some is not. Is there a way to copy the new information (-8 in Q6) into the grey row and then delete/highlight the blue row? Here, the grey row categorical information is maintained, assuming the keep='first' default of drop_duplicates is active.
Related Question.(Yellow row) How to delete rows that are not exact duplicates but contain no new information (more NaN)
Data table
Expected Output
The expected output would have the grey row updated with Q6 from the blue row and the blue row removed.
[['C1 (PK)', 'C2 (FK)', 'C3', 'C4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6']
['S1','P3','H1',Timestamp('2004-12-04 00:00:00'),-15.0,-27.4,nan,-10.0,-15.0,-8]]
Current Progress
My current code includes this line to drop all rows where all quantitative variables are NaN.df.dropna(subset=df.columns[4:],how='all', inplace=True)
Also, this line for deleting all rows where all quantitative variables are the same. df.drop_duplicates(subset=df.columns[4:], inplace=True)
Example code that can be copied into an IDE.
import pandas as pd
df = [['S1','P3','H1',Timestamp('2004-12-04 00:00:00'),-15.0,-27.4,nan,-10.0,-15.0,nan],
['S1','P3','H1',Timestamp('2004-12-20 00:00:00'),nan,nan,nan,nan,nan,nan],
['S1','P3','H2',Timestamp('2004-12-20 00:00:00'),-15.0,nan,nan,-10.0,nan,nan],
['S1','P3','H3',Timestamp('2004-12-07 00:00:00'),nan,nan,nan,nan,-15.0,-8.0],
['S1','P3','H1', Timestamp('2004-12-04 00:00:00'), -15.0,-27.4,nan,-10.0, -15.0, nan]]
cols = ['C1 (PK)', 'C2 (FK)', 'C3', 'C4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6']
pd.DataFrame(data=df,columns=cols)
df.drop_duplicates(inplace=True)
df.dropna(subset=df.columns[4:],how='all', inplace=True)
df.drop_duplicates(subset=df.columns[4:], inplace=True)
回答1:
Split of the categorical columns:
df_categorical = df[['C1 (PK)', 'C2 (FK)',"C3", "C4"]]
Perform a groupby on first 2 columns and select first element to keep:
df_categorical = df_categorical.groupby(["C1 (PK)", "C2 (FK)"]).first()
For the quantitative columns use groupby again and use mean this time:
df_quantitative = df.groupby(['C1 (PK)', 'C2 (FK)']).mean()
merge the two dataframes to get the result
df_final = pd.concat([df_quantitative, df_categorical], axis=1)
reset index
df_final.reset_index(inplace=True)
来源:https://stackoverflow.com/questions/59774087/how-to-join-two-rows-that-have-the-same-keys-and-complementary-values