问题
Assume that we have the following pandas dataframe:
df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})
input:
col1 col2 start
0 A>G TCT 1000
1 C>T ACA 2000
2 C>T TCA 3000
3 G>T TCA 4000
4 C>T GCT 5000
5 A>G ACT 6000
6 A>G CTG 10000
7 A>G ATG 20000
8 C>A TCT 10000
9 C>T ACA 2000
10 C>T TCA 3000
11 C>T TCA 4000
What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:
output:
type length diff
0 C>T 2 1000
1 A>G 3 14000
2 C>T 3 2000
回答1:
With a little setup, you can 100% vectorise this using GroupBy.agg
:
aggfunc = {
'col1': [('type', 'first'), ('length', 'count')],
'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]
}
grouper = df.col1.ne(df.col1.shift()).cumsum()
v = df.assign(key=grouper).groupby('key').agg(aggfunc)
v.columns = v.columns.droplevel(0)
v[v['diff'].ne(0)].reset_index(drop=True)
type length diff
0 C>T 2 1000
1 A>G 3 14000
2 C>T 3 2000
回答2:
probably something like the below:
import pandas as pd
from itertools import groupby
df = pd.DataFrame({
'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],
'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'],
'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})
final = []
pos = 0
for k,g in groupby([row.col1 for n,row in df.iterrows()]):
glist = [x for x in g]
first_pos = pos
last_pos = pos+len(glist)-1
if len(glist)>1:
print(glist)
val = df.iloc[first_pos].col1
first = df.iloc[first_pos].start
last = df.iloc[last_pos].start
final.append({'type':val,'length':len(glist),'diff':last-first})
pos = last_pos +1
final = pd.DataFrame(final)
print(final)
output:
diff length type
0 1000 2 C>T
1 14000 3 A>G
2 2000 3 C>T
回答3:
Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:
# add a group variable
values = df['col1'].values
# get locations where value changes
change = np.zeros(values.size, dtype=bool)
change[1:] = values[:-1] != values[1:]
df['group'] = change.cumsum() # summing change points yields the label
# do the aggregation
res = (df
.groupby('group')
.agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})
.rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})
)
# filter on more than one consecutive value
res = res[res['length'] > 1]
print(res)
diff type length
group
1 1000 C>T 2
4 14000 A>G 3
5 2000 C>T 3
回答4:
You can use pandas groupby
and more_itertools
:
import more_itertools as mit
def f(g):
result = pd.DataFrame([], columns={'type', 'length', 'diff'})
tp = g['col1'].iloc[0]
for group in mit.consecutive_groups(g.index):
group = list(group)
if len(group) == 1:
continue
cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})
result = pd.concat([result, cur_df], ignore_index=True)
return result
df.groupby('col1').apply(f).reset_index(drop=True)
来源:https://stackoverflow.com/questions/53383208/how-to-find-the-count-of-consecutive-same-string-values-in-a-pandas-dataframe