问题
My apologies for the length of this but I want to explain as fully as possible. I am completely stumped on how to solve this.
The Setup:
I have two dataframes the first has a list of all possible values in the first column there are no duplicate values in this column. Let's call it df_01. Theses are all the common possible values in each list. All additional columns represent independent lists. Each contains a number that represents how many days any given value of all possible values has been on that list. This dataframe (df_01) has a shape of (9277, 32). These dimensions can change but will general stay the same. The following is a small example of what it looks like.
df_01 before any actions:
index values list01 list02 ... list30 list31
0 aaa 5 1 ... NaN 83
1 bbb NaN NaN ... NaN 4
2 ccc 20 NaN ... NaN 32
3 ddd 1 27 ... NaN NaN
. . . . ... . .
. . . . ... . .
. . . . ... . .
9274 qqq NaN 15 ... NaN 6
9275 rrr 238 NaN ... NaN 102
9276 sss 3 2 ... NaN NaN
9277 ttt 12 NaN ... NaN 99
This first dataframe (df_01) will always be the values as they were the previous day.
The second dataframe. Let's call it df_2 will always have less row, and change from day to day in length, but always the same number of columns as (df_01). It currently has a shape of (1351, 32). In this dataframe (df_2) the first column has all the common values from each list as of today, and has no duplicates. The other columns in this dataframe (df_2) have a 1 if the value is on the list today and NaN if it does not. Here's an example.
df_02 before any actions:
index values list01 list02 ... list30 list31
0 aaa 1 1 ... NaN 1
1 bbb NaN 1 ... 1 1
2 ddd 1 1 ... NaN NaN
. . . . ... . .
. . . . ... . .
. . . . ... . .
1349 qqq NaN NaN ... 1 1
1350 rrr 1 NaN ... 1 NaN
1351 sss NaN 1 ... NaN 1
The Question:
What I want to accomplish is as follows.
1) For every value in each column if the value exists in the first dataframe (df_01) and not in the second (df_02) its counter in (df_01) will reset to NaN on a per column basis.
2) Then for every value in each column of the second dataframe (df_02) if the value exists in the same column on both dataframes sum the values.
3) If aaa on list01 of (df_01) = 5 and aaa on list01 of (df_2) = 1 then aaa on list01 of (df_02) will become 6. This will keep a running count.
4) If the value is NaN in both no action is needed.
5) If a value is NaN on (df_01) and 1 on (df_02) it stays 1.
*** The value in df_02 will always be a 1 or an NaN prior to the summation. It is a binary choice of whether or not the value in the values column is in the individual list that day or not.
*** Notice value ccc, and, ttt are in (df_01) as they are possible values but not (df_02) as they were on none of the lists today.
*** The asterisk around values like *NaN*
or *6*
is to denote the values that will change it would not actually be in the data.
The dataframes should look like this after the procedure:
df_01
index values list01 list02 ... list30 list31
0 aaa 5 1 ... NaN 83
1 bbb NaN NaN ... NaN 4
2 ccc NaN NaN ... NaN NaN
3 ddd 1 27 ... NaN NaN
. . . . ... . .
. . . . ... . .
. . . . ... . .
9274 qqq NaN *NaN* ... NaN 6
9275 rrr 238 NaN ... NaN *NaN*
9276 sss *NaN* 2 ... NaN 24
9277 ttt 12 NaN ... NaN 99
df_02
index values list01 list02 ... list30 list31
0 aaa *6* *2* ... NaN *84*
1 bbb NaN 1 ... 1 *5*
2 ddd *2* *28* ... NaN NaN
. . . . ... . .
. . . . ... . .
. . . . ... . .
1349 qqq NaN NaN ... 1 *7*
1350 rrr *239* NaN ... 1 NaN
1351 sss NaN *3* ... NaN *25*
How would I go about accomplishing something like this? I don't even know where to begin. Any ideas, even if not completely working, just to point me in the right direction would be appreciated. Please let me know if anything needs clarification.
Thanks
回答1:
df1 = df1.set_index('values')
df2 = df2.set_index('values')
cols = [*df1.columns]
for col in cols:
#Update to df1
df1[col].update(df2.loc[df2[col].isnull(), col].fillna('-'))
df1[col].replace('-', np.NaN, inplace = True)
#Update to df2, sum if they both have numbers
df2[col].update(df2.loc[~df2[col].isnull(), col] + df1.loc[~df1[col].isnull(), col])
This should do what you want. We will loop over each row then update them individually. Make sure the cols
list contains the correct columns based on your df's.
The reason we have to use .fillna('-')
in the update to df1 is because you can't replace a value with NaN, so we have to fill it with something else, then we can replace it back to NaN.
来源:https://stackoverflow.com/questions/60442425/comparing-values-in-two-pandas-dataframes-to-keep-a-running-count