Removing an element from a list based on a predicate

前端未结

关注

 9  1285

I want to remove an element from list, such that the element contains \'X\' or \'N\'. I have to apply for a large genome. Here is an example:

相关标签:

9条回答

谎友^

2021-01-18 12:59
As S.Mark requested here is my version. It's probably slower but does make it easier to change what gets removed.
```
def filter_genome(genome, killlist = set("X N".split()):
    return [codon for codon in genome if 0 == len(set(codon) | killlist)]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人及你

2021-01-18 13:01
If you're dealing with extremely large lists, you want to use methods that don't involve traversing the entire list any more than you absolutely need to.

Your best bet is likely to be creating a filter function, and using itertools.ifilter, e.g.:
```
new_seq = itertools.ifilter(lambda x: 'X' in x or 'N' in x, seq)
```
This defers actually testing every element in the list until you actually iterate over it. Note that you can filter a filtered sequence just as you can the original sequence:
```
new_seq1 = itertools.ifilter(some_other_predicate, new_seq)
```
Edit:

Also, a little testing shows that memoizing found entries in a set is likely to provide enough of an improvement to be worth doing, and using a regular expression is probably not the way to go:
```
seq = ['AAT','XAC','ANT','TTA']
>>> p = re.compile('[X|N]')
>>> timeit.timeit('[x for x in seq if not p.search(x)]', 'from __main__ import p, seq')
3.4722548536196314
>>> timeit.timeit('[x for x in seq if "X" not in x and "N" not in x]', 'from __main__ import seq')
1.0560532134670666
>>> s = set(('XAC', 'ANT'))
>>> timeit.timeit('[x for x in seq if x not in s]', 'from __main__ import s, seq')
0.87923730529996647
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

臣服心动

2021-01-18 13:03

Another not fastest way but I think it reads nicely

>>> [x for x in ['AAT','XAC','ANT','TTA'] if not any(y in x for y in "XN")]
['AAT', 'TTA']

>>> [x for x in ['AAT','XAC','ANT','TTA'] if not set("XN")&set(x)]
['AAT', 'TTA']

This way will be faster for long codons (assuming there is some repetition)

codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
    if s not in memo:
        memo[s]=not any(y in s for y in "XN")
    return memo[s]

print filter(pred,codon)

Here is the method suggested by James Brooks, you'd have to test to see which is faster for your data

codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
    if s not in memo:
        memo[s]= not set("XN")&set(s)
    return memo[s]

print filter(pred,codon)

For this sample codon, the version using sets is about 10% slower

0 讨论(0)

别那么骄傲

2021-01-18 13:05
For basis purpose
```
>>> [x for x in ['AAT','XAC','ANT','TTA'] if "X" not in x and "N" not in x]
['AAT', 'TTA']
```
But if you have huge amount of data, I suggest you to use dict or set

And If you have many characters other than X and N, you may do like this
```
>>> [x for x in ['AAT','XAC','ANT','TTA'] if not any(ch for ch in list(x) if ch in ["X","N","Y","Z","K","J"])]
['AAT', 'TTA']
```
NOTE: list(x) can be just x, and ["X","N","Y","Z","K","J"] can be just "XNYZKJ", and refer gnibbler answer, He did the best one.
0 讨论(0)
发布评论:

提交评论
- 加载中...

没有蜡笔的小新

2021-01-18 13:12

Any reason for duplicating the entire list? How about:

>>> def pred(item, haystack="XN"):
...     return any(needle in item for needle in haystack)
...
>>> lst = ['AAT', 'XAC', 'ANT', 'TTA']
>>> idx = 0
>>> while idx < len(lst):
...     if pred(lst[idx]):
...         del lst[idx]
...     else:
...         idx = idx + 1
...
>>> lst
['AAT', 'TTA']

I know that list comprehensions are all the rage these days, but if the list is long we don't want to duplicate it without any reason right? You can take this to the next step and create a nice utility function:

>>> def remove_if(coll, predicate):
...     idx = len(coll) - 1
...     while idx >= 0:
...         if predicate(coll[idx]):
...             del coll[idx]
...         idx = idx - 1
...     return coll
...
>>> lst = ['AAT', 'XAC', 'ANT', 'TTA']
>>> remove_if(lst, pred)
['AAT', 'TTA']
>>> lst
['AAT', 'TTA']

0 讨论(0)

感情败类

2021-01-18 13:14
It is (asympotically) faster to use a regular expression than searching many times in the same string for a certain character: in fact, with a regular expression the sequences is only be read at most once (instead of twice when the letters are not found, in gnibbler's original answer, for instance). With gnibbler's memoization, the regular expression approach reads:
```
import re
remove = re.compile('[XN]').search

codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
    if s not in memo:
        memo[s]= not remove(s)
    return memo[s]

print filter(pred,codon)
```
This should be (asymptotically) faster than using the "in s" or the "set" checks (i.e., the code above should be faster for long enough strings s).

I originally thought that gnibbler's answer could be written in a faster and more compact way with dict.setdefault():
```
codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
    return memo.setdefault(s, not any(y in s for y in "XN"))

print filter(pred,codon)
```
However, as gnibbler noted, the value in setdefault is always evaluated (even though, in principle, it could be evaluated only when the dictionary key is not found).
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页