How to find the count of consecutive same string values in a pandas dataframe?

问题

Assume that we have the following pandas dataframe:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

input:
 col1 col2  start
0  A>G  TCT   1000
1  C>T  ACA   2000
2  C>T  TCA   3000
3  G>T  TCA   4000
4  C>T  GCT   5000
5  A>G  ACT   6000
6  A>G  CTG  10000
7  A>G  ATG  20000
8  C>A  TCT  10000
9  C>T  ACA   2000
10 C>T  TCA   3000
11 C>T  TCA   4000

What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:

output:
 type length  diff
0  C>T  2   1000
1  A>G  3   14000
2  C>T  3   2000

回答1:

With a little setup, you can 100% vectorise this using GroupBy.agg:

aggfunc = {
    'col1': [('type', 'first'), ('length', 'count')], 
    'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]
}

grouper = df.col1.ne(df.col1.shift()).cumsum()

v = df.assign(key=grouper).groupby('key').agg(aggfunc)
v.columns = v.columns.droplevel(0)
v[v['diff'].ne(0)].reset_index(drop=True)

  type  length   diff
0  C>T       2   1000
1  A>G       3  14000
2  C>T       3   2000

回答2:

probably something like the below:

import pandas as pd
from itertools import groupby

df = pd.DataFrame({
    'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],
    'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'], 
    'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})

final = []
pos = 0
for k,g in groupby([row.col1 for n,row in df.iterrows()]):
    glist = [x for x in g]
    first_pos = pos
    last_pos = pos+len(glist)-1
    if len(glist)>1:
        print(glist)
        val = df.iloc[first_pos].col1
        first = df.iloc[first_pos].start
        last = df.iloc[last_pos].start
        final.append({'type':val,'length':len(glist),'diff':last-first})
    pos = last_pos +1
final = pd.DataFrame(final)
print(final)

output:

diff    length  type
0   1000    2   C>T
1   14000   3   A>G
2   2000    3   C>T

回答3:

Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:

# add a group variable
values = df['col1'].values
# get locations where value changes
change = np.zeros(values.size, dtype=bool)
change[1:] = values[:-1] != values[1:]
df['group'] = change.cumsum()  # summing change points yields the label

# do the aggregation
res = (df
 .groupby('group')
 .agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})
 .rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})
)
# filter on more than one consecutive value
res = res[res['length'] > 1]

print(res)

        diff type  length
group                    
1       1000  C>T       2
4      14000  A>G       3
5       2000  C>T       3

回答4:

You can use pandas groupby and more_itertools:

import more_itertools as mit
def f(g):
    result = pd.DataFrame([], columns={'type', 'length', 'diff'})
    tp = g['col1'].iloc[0]
    for group in mit.consecutive_groups(g.index):
        group = list(group)
        if len(group) == 1:
            continue
        cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})
        result = pd.concat([result, cur_df], ignore_index=True)
    return result

df.groupby('col1').apply(f).reset_index(drop=True)

来源：https://stackoverflow.com/questions/53383208/how-to-find-the-count-of-consecutive-same-string-values-in-a-pandas-dataframe

标签

python

dataframe