问题
I have a dataframe like the following:
ID 0 1 2 3 4 5 6 7 8 ... 81 82 83 84 85 86 87 88 89 90 total
-----------------------------------------------------------------------------------------------------
0 A 2 21 0 18 3 0 0 0 2 ... 0 0 0 0 0 0 0 0 0 0 156
1 B 0 20 12 2 0 8 14 23 0 ... 0 0 0 0 0 0 0 0 0 0 231
2 C 0 38 19 3 1 3 3 7 1 ... 0 0 0 0 0 0 0 0 0 0 78
3 D 3 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5
and I want to know the % of events (the numbers in the cells) before and after the first sequence of zeros of length n appears in each row. This problem started as another question found here: Length of first sequence of zeros of given size after certain column in pandas dataframe, and I am trying to modify the code to do what I need, but I keep getting errors and can't seem to find the right way. This is what I have tried:
def func(row, n):
"""Returns the number of events before the
first sequence of 0s of length n is found
"""
idx = np.arange(0, 91)
a = row[idx]
b = (a != 0).cumsum()
c = b[a == 0]
d = c.groupby(c).count()
#in case there is no sequence of 0s with length n
try:
e = c[c >= d.index[d >= n][0]]
f = str(e.index[0])
except IndexError:
e = [90]
f = str(e[0])
idx_sliced = np.arange(0, int(f)+1)
a = row[idx_sliced]
if (int(f) + n > 90):
perc_before = 100
else:
perc_before = a.cumsum().tail(1).values[0]/row['total']
return perc_before
As is, the error I get is:
---> perc_before = a.cumsum().tail(1).values[0]/row['total']
TypeError: ('must be str, not int', 'occurred at index 0')
Finally, I would apply this function to a dataframe and return a new column with the % of events before the first sequence of n 0s in each row, like this:
ID 0 1 2 3 4 5 6 7 8 ... 81 82 83 84 85 86 87 88 89 90 total %_before
---------------------------------------------------------------------------------------------------------------
0 A 2 21 0 18 3 0 0 0 2 ... 0 0 0 0 0 0 0 0 0 0 156 43
1 B 0 20 12 2 0 8 14 23 0 ... 0 0 0 0 0 0 0 0 0 0 231 21
2 C 0 38 19 3 1 3 3 7 1 ... 0 0 0 0 0 0 0 0 0 0 78 90
3 D 3 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5 100
If trying to solve this, you can test by using this sample input:
a = pd.Series([1,1,13,0,0,0,4,0,0,0,0,0,12,1,1])
b = pd.Series([1,1,13,0,0,0,4,12,1,12,3,0,0,5,1])
c = pd.Series([1,1,13,0,0,0,4,12,2,0,5,0,5,1,1])
d = pd.Series([1,1,13,0,0,0,4,12,1,12,4,50,0,0,1])
e = pd.Series([1,1,13,0,0,0,4,12,0,0,0,54,0,1,1])
df = pd.DataFrame({'0':a, '1':b, '2':c, '3':d, '4':e})
df = df.transpose()
回答1:
Give this a try:
def percent_before(row, n, ncols):
"""Return the percentage of activities happen before
the first sequence of at least `n` consecutive 0s
"""
start_index, i, size = 0, 0, 0
for i in range(ncols):
if row[i] == 0:
# increase the size of the island
size += 1
elif size >= n:
# found the island we want
break
else:
# start a new island
# row[start_index] is always non-zero
start_index = i
size = 0
if size < n:
# didn't find the island we want
return 1
else:
# get the sum of activities that happen
# before the island
idx = np.arange(0, start_index + 1).astype(str)
return row.loc[idx].sum() / row['total']
df['percent_before'] = df.apply(percent_before, n=3, ncols=15, axis=1)
Result:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 total percent_before
0 1 1 13 0 0 0 4 0 0 0 0 0 12 1 1 33 0.454545
1 1 1 13 0 0 0 4 12 1 12 3 0 0 5 1 53 0.283019
2 1 1 13 0 0 0 4 12 2 0 5 0 5 1 1 45 0.333333
3 1 1 13 0 0 0 4 12 1 12 4 50 0 0 1 99 0.151515
4 1 1 13 0 0 0 4 12 0 0 0 54 0 1 1 87 0.172414
For the full frame, call apply
with ncols=91
.
回答2:
Another possible solution:
def get_vals(df, n):
df, out = df.T, []
for col in df.columns:
diff_to_previous = df[col] != df[col].shift(1)
g = df.groupby(diff_to_previous.cumsum())[col].agg(['idxmin', 'size'])
vals = df.loc[g.loc[g['size'] >= n, 'idxmin'].values, col]
if len(vals):
out.append( df.loc[np.arange(0, vals[vals == 0].index[0]), col].sum() / df[col].sum() )
else:
out.append( 1.0 )
return out
df['percent_before'] = get_vals(df, n=3)
print(df)
Prints:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 percent_before
0 1 1 13 0 0 0 4 0 0 0 0 0 12 1 1 0.454545
1 1 1 13 0 0 0 4 12 1 12 3 0 0 5 1 0.283019
2 1 1 13 0 0 0 4 12 2 0 5 0 5 1 1 0.333333
3 1 1 13 0 0 0 4 12 1 12 4 50 0 0 1 0.151515
4 1 1 13 0 0 0 4 12 0 0 0 54 0 1 1 0.172414
回答3:
As one of the comment of the previous question was about the speed, I guess you can try to vectorize the problem. I used this dataframe to try (slightly different than your original input):
ID 0 1 2 3 4 5 6 7 8 total
0 A 2 21 0 18 3 0 0 0 2 46
1 B 0 0 12 2 0 8 14 23 0 59
2 C 0 38 19 3 1 3 3 7 1 75
3 D 3 0 0 1 0 0 0 0 0 4
Now what I think is chaining command to create a mask and find where the data is not equal to 0, then use cumsum
along the column axis and see where the diff
along the column is equal to 0. To find the first one, you can use cummax
so that all the columns after (row-wise) are considered are True
. Mask the original dataframe with the opposite of this mask, sum along the columns and divide by total. for example with n=2:
n=2
df['%_before'] = df[~(df.ne(0).cumsum(axis=1).diff(n, axis=1)[range(9)]
.eq(0).cummax(axis=1))].sum(axis=1)/df.total
print (df)
ID 0 1 2 3 4 5 6 7 8 total %_before
0 A 2 21 0 18 3 0 0 0 2 46 0.956522
1 B 0 0 12 2 0 8 14 23 0 59 0.000000
2 C 0 38 19 3 1 3 3 7 1 75 1.000000
3 D 3 0 0 1 0 0 0 0 0 4 0.750000
In your case, you need to change range(9)
by range(91)
to get all your columns
回答4:
You can do this using the rolling
method.
For your example input, given the number of zeros is 5, we can use
df.rolling(window=5, axis=1).apply(lambda x : np.sum(x))
The output would like
0 1 2 3 4 5 6 7 8 9 10 11 12 13 \
0 NaN NaN NaN NaN 15.0 14.0 17.0 4.0 4.0 4.0 4.0 0.0 12.0 13.0
1 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 17.0 29.0 32.0 28.0 16.0 20.0
2 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 18.0 18.0 23.0 19.0 12.0 11.0
3 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 17.0 29.0 33.0 79.0 67.0 66.0
4 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 16.0 16.0 16.0 66.0 54.0 55.0
14
0 14.0
1 9.0
2 12.0
3 55.0
4 56.0
Looking at the output, its very easy to see that in the first row, for column 11, since the value is 0, it means that starting at position 7, you have 5 zeros. Since none of the other rows have 0 in them, it means that none of the other rows have 5 contiguous zeros in them.
来源:https://stackoverflow.com/questions/59618058/percentage-of-events-before-and-after-a-sequence-of-zeros-in-pandas-rows