问题
I have a pandas dataframe df
as
Date cost NC
20 5 NaN
21 7 NaN
23 9 78.0
25 6 80.0
Now what I need to do is fillup the missing dates and hence fill the column with a value say x
only if there is number in the previous row. That is I want the output like
Date cost NC
20 5 NaN
21 7 NaN
22 x NaN
23 9 78.0
24 x x
25 6 80.0
See Date 22 was missing and on 21 NC
was missing, So on 22 cost
is assigned to x but NC
is assigned to NaN
. Now setting the Date
column to index
and reindex
ing it to missing values I can get upto here
Date cost NC
20 5.0 NaN
21 7.0 NaN
22 NaN NaN
23 9.0 78.0
24 NaN NaN
25 6.0 80.0
But I cant get to the final output. If you think this way it is like ffill()
but instead of filling from previous row you have to put x
here.
I have another problem. here I have a dataframe df
like this
Date type cost
10 a 30
11 a 30
11 b 25
13 a 27
Here also I have to fill the missing value and make it like this
Date type cost
10 a 30
11 a 30
11 b 25
12 a 30
12 b 25
13 a 27
as you can see there was 2 data row for date 11 so both are copied to 12. I wrote this program for the problem
missing=[12]
for i in missing:
new_date=i
i-=1 #go to previous date
k=df[df["Date"] == i].index.tolist()[-1]+1 #index where to be filled
data=pd.DataFrame(df[df["Date"] == i].values,columns=df.columns)
data["Date"]=new_date
df=pd.concat([df.iloc[:k],data,df.iloc[k:]]).reset_index(drop=True)
Now for a large data set the above program takes a lot of time as has to find index and concat 3 dataframe each time. Is there any better and efficient way to solve this problem?
回答1:
I don't think there is a way to pad just the "middle" values, but here's a way to do it (using ffill
, bfill
and fillna
):
In [11]: df1 # assuming Date is the index via df.set_index("Date")
Out[11]:
cost NC
Date
20 5 NaN
21 7 NaN
23 9 78.0
25 6 80.0
In [12]: df2 = df1.reindex(np.arange(20,27))
# 26 is sufficient, but let's see it working!
In [13]: df2
Out[13]:
cost NC
Date
20 5.0 NaN
21 7.0 NaN
22 NaN NaN
23 9.0 78.0
24 NaN NaN
25 6.0 80.0
26 NaN NaN
You don't want to fill in the "outside" NaNs, which can be got with:
In [14]: df2.bfill().notnull() & df2.ffill().notnull()
Out[14]:
cost NC
Date
20 True False
21 True False
22 True False
23 True True
24 True True
25 True True
26 False False
Now, we can update these (if they would be updated with a fillna
):
In [15]: df2[df2.bfill().notnull() & df2.ffill().notnull()] = df2.fillna(0) # x = 0
In [16]: df2
Out[15]:
cost NC
Date
20 5.0 NaN
21 7.0 NaN
22 0.0 NaN
23 9.0 78.0
24 0.0 0.0
25 6.0 80.0
26 NaN NaN
To (partially) answer the second question, IMO you're always better off in that situation to start with a pivot (this will give you a much better starting point):
In [21]: df
Out[21]:
Date type cost
0 10 a 30
1 11 a 30
2 11 b 25
3 13 a 27
In [22]: df.pivot_table("cost", "Date", "type")
Out[22]:
type a b
Date
10 30.0 NaN
11 30.0 25.0
13 27.0 NaN
Perhaps you are looking to fill forward from there? (and unstack if necessary).
来源:https://stackoverflow.com/questions/37821653/filling-missing-middle-values-in-pandas-dataframe