问题
Pandas has a very fast and nice string method, extract(). This method works perfectly with a regex such as this one:
strict_pattern = r"^(?P<pre_spacer>ACGAG)(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT)"
test_df
R1
21 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG
22 ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG
23 ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT
24 ACGAGAATAACGTTTGGTGGAGTCTACCAC
25 ACGAGGGGAATAAATATTGGAGTCTCCTCC
26 ACGAGATTGGGTATGCTGGAGTCTCTGTTC
27 ACGAGGTACCCGCGCCATGGAGTCTCTCTG
28 ACGAGTGGTTTTTGTCGTGGAGTCTCACCA
29 ACGAGACGTGTCCACCATGGAGTCTTGTCT
test_df.R1.str.extract(strict_pattern)
pre_spacer UMI post_spacer
21 ACGAG TTTTCGTATTTT TGGAGTCT
22 ACGAG TAGGGAGGGGGG TGGAGTCT
23 ACGAG GGGGGGGAGGC TGGAGTCT
24 ACGAG AATAACGTTTGG TGGAGTCT
25 ACGAG GGGAATAAATAT TGGAGTCT
26 ACGAG ATTGGGTATGC TGGAGTCT
27 ACGAG GTACCCGCGCCA TGGAGTCT
28 ACGAG TGGTTTTTGTCG TGGAGTCT
29 ACGAG ACGTGTCCACCA TGGAGTCT
But as it is not using the regex
package but re
(if I'm not wrong), it does not support the usage of a regex which allows mismatches. Such as this one:
lax_pattern = r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}"
This regex allows one substitution in the pre_spacer and post_spacer sequences.
As shown in this example, the regex
package allows this kind of regex:
seq = 'ACGAGCGCCCACCCGCCTGGAGTCTACCAACGGTAACAGCTG'
lax_pattern = r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}"
m = regex.match(lax_pattern,seq)
m.groupdict()
{'pre_spacer': 'ACGAG', 'UMI': 'CGCCCACCCGCC', 'post_spacer': 'TGGAGTCT'}
What I would like is to make extract() compatible with this kind of regex, or any fast workaround.
I have done this but is 12 times slower than extract and I deal with very big dataframes.
def extract_regex(pattern, seq):
m = regex.match(pattern,seq)
try:
d=m.groupdict()
return list(d.values())
except AttributeError:
return [np.nan]*3
test_df["pre_spacer"],test_df["UMI"],test_df["post_spacer"] = zip(*test_df.apply(lambda row: extract_regex(lax_pattern,row.R1) ,axis=1))
test_df
R1 pre_spacer UMI post_spacer
21 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG ACGAG TTTTCGTATTTT TGGAGTCT
22 ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG ACGAG TAGGGAGGGGGG TGGAGTCT
23 ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT ACGAG GGGGGGGAGGC TGGAGTCT
24 ACGAGAATAACGTTTGGTGGAGTCTACCAC ACGAG AATAACGTTTGG TGGAGTCT
25 ACGAGGGGAATAAATATTGGAGTCTCCTCC ACGAG GGGAATAAATAT TGGAGTCT
26 ACGAGATTGGGTATGCTGGAGTCTCTGTTC ACGAG ATTGGGTATGC TGGAGTCT
27 ACGAGGTACCCGCGCCATGGAGTCTCTCTG ACGAG GTACCCGCGCCA TGGAGTCT
28 ACGAGTGGTTTTTGTCGTGGAGTCTCACCA ACGAG TGGTTTTTGTCG TGGAGTCT
29 ACGAGACGTGTCCACCATGGAGTCTTGTCT ACGAG ACGTGTCCACCA TGGAGTCT
Any ideas of how to tune the pandas extract()
method or to provide the desired function with a similar speed?
Thanks in advance!
Pau.
回答1:
Until pandas
is compiled with the regex
library, you can't use these features in .extract
.
You will probably have to rely on .apply
with a custom method:
import regex
import pandas as pd
test_df = pd.DataFrame({"R1": ['ACGAGTTTTCGTATTTTTGGAGTCTTGTGG', 'AAAAGGGA']})
lax_pattern = regex.compile(r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}")
empty_val = pd.Series(["","",""], index=['pre_spacer','UMI','post_spacer'])
def extract_regex(seq):
m = lax_pattern.search(seq)
if m:
return pd.Series(list(m.groupdict().values()), index=['pre_spacer','UMI','post_spacer']) # list(m.groupdict().values())
else:
return empty_val
test_df[["pre_spacer","UMI","post_spacer"]] = test_df['R1'].apply(extract_regex)
Output:
>>> test_df
R1 pre_spacer UMI post_spacer
0 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG ACGAG TTTTCGTATTTT TGGAGTCT
1 AAAAGGGA
来源:https://stackoverflow.com/questions/57921051/pandas-extract-regex-allowing-mismatches