pandas extract regex allowing mismatches

人盡茶涼 提交于 2021-02-17 03:30:16

问题


Pandas has a very fast and nice string method, extract(). This method works perfectly with a regex such as this one:

strict_pattern = r"^(?P<pre_spacer>ACGAG)(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT)"

test_df

    R1
21  ACGAGTTTTCGTATTTTTGGAGTCTTGTGG
22  ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG
23  ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT
24  ACGAGAATAACGTTTGGTGGAGTCTACCAC
25  ACGAGGGGAATAAATATTGGAGTCTCCTCC
26  ACGAGATTGGGTATGCTGGAGTCTCTGTTC
27  ACGAGGTACCCGCGCCATGGAGTCTCTCTG
28  ACGAGTGGTTTTTGTCGTGGAGTCTCACCA
29  ACGAGACGTGTCCACCATGGAGTCTTGTCT
test_df.R1.str.extract(strict_pattern)

    pre_spacer  UMI     post_spacer
21  ACGAG   TTTTCGTATTTT    TGGAGTCT
22  ACGAG   TAGGGAGGGGGG    TGGAGTCT
23  ACGAG   GGGGGGGAGGC     TGGAGTCT
24  ACGAG   AATAACGTTTGG    TGGAGTCT
25  ACGAG   GGGAATAAATAT    TGGAGTCT
26  ACGAG   ATTGGGTATGC     TGGAGTCT
27  ACGAG   GTACCCGCGCCA    TGGAGTCT
28  ACGAG   TGGTTTTTGTCG    TGGAGTCT
29  ACGAG   ACGTGTCCACCA    TGGAGTCT

But as it is not using the regex package but re (if I'm not wrong), it does not support the usage of a regex which allows mismatches. Such as this one:

lax_pattern = r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}"

This regex allows one substitution in the pre_spacer and post_spacer sequences.

As shown in this example, the regex package allows this kind of regex:

seq = 'ACGAGCGCCCACCCGCCTGGAGTCTACCAACGGTAACAGCTG'
lax_pattern = r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}"
m = regex.match(lax_pattern,seq)
m.groupdict()

{'pre_spacer': 'ACGAG', 'UMI': 'CGCCCACCCGCC', 'post_spacer': 'TGGAGTCT'}

What I would like is to make extract() compatible with this kind of regex, or any fast workaround.

I have done this but is 12 times slower than extract and I deal with very big dataframes.

def extract_regex(pattern, seq):
    m = regex.match(pattern,seq)
    try:
        d=m.groupdict()
        return list(d.values())
    except AttributeError:
        return [np.nan]*3

test_df["pre_spacer"],test_df["UMI"],test_df["post_spacer"] = zip(*test_df.apply(lambda row: extract_regex(lax_pattern,row.R1) ,axis=1))

test_df

    R1  pre_spacer  UMI     post_spacer
21  ACGAGTTTTCGTATTTTTGGAGTCTTGTGG  ACGAG   TTTTCGTATTTT    TGGAGTCT
22  ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG  ACGAG   TAGGGAGGGGGG    TGGAGTCT
23  ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT  ACGAG   GGGGGGGAGGC     TGGAGTCT
24  ACGAGAATAACGTTTGGTGGAGTCTACCAC  ACGAG   AATAACGTTTGG    TGGAGTCT
25  ACGAGGGGAATAAATATTGGAGTCTCCTCC  ACGAG   GGGAATAAATAT    TGGAGTCT
26  ACGAGATTGGGTATGCTGGAGTCTCTGTTC  ACGAG   ATTGGGTATGC     TGGAGTCT
27  ACGAGGTACCCGCGCCATGGAGTCTCTCTG  ACGAG   GTACCCGCGCCA    TGGAGTCT
28  ACGAGTGGTTTTTGTCGTGGAGTCTCACCA  ACGAG   TGGTTTTTGTCG    TGGAGTCT
29  ACGAGACGTGTCCACCATGGAGTCTTGTCT  ACGAG   ACGTGTCCACCA    TGGAGTCT

Any ideas of how to tune the pandas extract() method or to provide the desired function with a similar speed?

Thanks in advance!

Pau.


回答1:


Until pandas is compiled with the regex library, you can't use these features in .extract.

You will probably have to rely on .apply with a custom method:

import regex
import pandas as pd

test_df = pd.DataFrame({"R1": ['ACGAGTTTTCGTATTTTTGGAGTCTTGTGG', 'AAAAGGGA']})

lax_pattern = regex.compile(r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}")

empty_val = pd.Series(["","",""], index=['pre_spacer','UMI','post_spacer'])

def extract_regex(seq):
    m = lax_pattern.search(seq)
    if m:
        return pd.Series(list(m.groupdict().values()), index=['pre_spacer','UMI','post_spacer']) #  list(m.groupdict().values())
    else:
        return empty_val


test_df[["pre_spacer","UMI","post_spacer"]] = test_df['R1'].apply(extract_regex)

Output:

>>> test_df
                               R1 pre_spacer           UMI post_spacer
0  ACGAGTTTTCGTATTTTTGGAGTCTTGTGG      ACGAG  TTTTCGTATTTT    TGGAGTCT
1                        AAAAGGGA                                     


来源:https://stackoverflow.com/questions/57921051/pandas-extract-regex-allowing-mismatches

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!