Filling missing middle values in pandas dataframe

我与影子孤独终老i 提交于 2021-02-19 05:25:06

问题


I have a pandas dataframe df as

Date  cost    NC
20     5   NaN
21     7   NaN
23     9  78.0
25     6  80.0

Now what I need to do is fillup the missing dates and hence fill the column with a value say x only if there is number in the previous row. That is I want the output like

Date  cost    NC
20     5   NaN
21     7   NaN
22     x   NaN
23     9  78.0
24     x   x
25     6  80.0

See Date 22 was missing and on 21 NC was missing, So on 22 cost is assigned to x but NC is assigned to NaN. Now setting the Date column to index and reindexing it to missing values I can get upto here

Date  cost    NC 
20     5.0   NaN
21     7.0   NaN
22     NaN   NaN
23     9.0  78.0
24     NaN   NaN
25     6.0  80.0

But I cant get to the final output. If you think this way it is like ffill() but instead of filling from previous row you have to put x here.

I have another problem. here I have a dataframe df like this

Date type  cost
10    a    30
11    a    30
11    b    25
13    a    27

Here also I have to fill the missing value and make it like this

Date type cost
10    a   30
11    a   30
11    b   25
12    a   30
12    b   25
13    a   27

as you can see there was 2 data row for date 11 so both are copied to 12. I wrote this program for the problem

missing=[12]
for i in missing:
    new_date=i
    i-=1        #go to previous date
    k=df[df["Date"] == i].index.tolist()[-1]+1 #index where to be filled
    data=pd.DataFrame(df[df["Date"] == i].values,columns=df.columns)
    data["Date"]=new_date
    df=pd.concat([df.iloc[:k],data,df.iloc[k:]]).reset_index(drop=True)

Now for a large data set the above program takes a lot of time as has to find index and concat 3 dataframe each time. Is there any better and efficient way to solve this problem?


回答1:


I don't think there is a way to pad just the "middle" values, but here's a way to do it (using ffill, bfill and fillna):

In [11]: df1  # assuming Date is the index via df.set_index("Date")
Out[11]:
      cost    NC
Date
20       5   NaN
21       7   NaN
23       9  78.0
25       6  80.0

In [12]: df2 = df1.reindex(np.arange(20,27))
         # 26 is sufficient, but let's see it working!

In [13]: df2
Out[13]:
      cost    NC
Date
20     5.0   NaN
21     7.0   NaN
22     NaN   NaN
23     9.0  78.0
24     NaN   NaN
25     6.0  80.0
26     NaN   NaN

You don't want to fill in the "outside" NaNs, which can be got with:

In [14]: df2.bfill().notnull() & df2.ffill().notnull()
Out[14]:
       cost     NC
Date
20     True  False
21     True  False
22     True  False
23     True   True
24     True   True
25     True   True
26    False  False

Now, we can update these (if they would be updated with a fillna):

In [15]: df2[df2.bfill().notnull() & df2.ffill().notnull()] = df2.fillna(0) # x = 0

In [16]: df2
Out[15]:
      cost    NC
Date
20     5.0   NaN
21     7.0   NaN
22     0.0   NaN
23     9.0  78.0
24     0.0   0.0
25     6.0  80.0
26     NaN   NaN

To (partially) answer the second question, IMO you're always better off in that situation to start with a pivot (this will give you a much better starting point):

In [21]: df
Out[21]:
   Date type  cost
0    10    a    30
1    11    a    30
2    11    b    25
3    13    a    27

In [22]: df.pivot_table("cost", "Date", "type")
Out[22]:
type     a     b
Date
10    30.0   NaN
11    30.0  25.0
13    27.0   NaN

Perhaps you are looking to fill forward from there? (and unstack if necessary).



来源:https://stackoverflow.com/questions/37821653/filling-missing-middle-values-in-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!