问题
I have a TAB-delimited .txt file that looks like this.
Gene_name A B C D E F
Gene1 1 0 5 2 0 0
Gene2 4 45 0 0 32 1
Gene3 0 23 0 4 0 54
Gene4 12 0 6 8 7 4
Gene5 4 0 0 6 0 7
Gene6 0 6 8 0 0 5
Gene7 13 45 64 234 0 6
Gene8 11 6 0 7 7 9
Gene9 6 0 12 34 0 11
Gene10 23 4 6 7 89 0
I want to extract rows in which at least 3 columns have values > 0.. How do I do this using pandas? I am clueless about how to use conditions in .txt files.
thanks very much!
update: adding on to this question, how do I analyze specific columns for this conditon.. let's say I look into column A, C, E & F and then extract rows that have at least 3 of these columns with values >5.
cheers!
回答1:
Piggy backing off of @MaxU solution, I like go ahead put 'gene_name' into the index not worry about all that index slicing:
df = pd.read_csv(tfile, delim_whitespace=True, index_col=0)
df[df.gt(0).sum(1).ge(3)]
Edit for question update:
df[df[['A','C','E','F']].gt(5).sum(1).ge(3)]
Output:
A B C D E F
Gene_name
Gene4 12 0 6 8 7 4
Gene7 13 45 64 234 0 6
Gene8 11 6 0 7 7 9
Gene9 6 0 12 34 0 11
Gene10 23 4 6 7 89 0
回答2:
df = pd.read_csv(filename, delim_whitespace=True)
In [22]: df[df.select_dtypes(['number']).gt(0).sum(axis=1).ge(3)]
Out[22]:
Gene_name A B C D E F
0 Gene1 1 0 5 2 0 0
1 Gene2 4 45 0 0 32 1
2 Gene3 0 23 0 4 0 54
3 Gene4 12 0 6 8 7 4
4 Gene5 4 0 0 6 0 7
5 Gene6 0 6 8 0 0 5
6 Gene7 13 45 64 234 0 6
7 Gene8 11 6 0 7 7 9
8 Gene9 6 0 12 34 0 11
9 Gene10 23 4 6 7 89 0
some explanation:
In [25]: df.select_dtypes(['number']).gt(0)
Out[25]:
A B C D E F
0 True False True True False False
1 True True False False True True
2 False True False True False True
3 True False True True True True
4 True False False True False True
5 False True True False False True
6 True True True True False True
7 True True False True True True
8 True False True True False True
9 True True True True True False
In [26]: df.select_dtypes(['number']).gt(0).sum(axis=1)
Out[26]:
0 3
1 4
2 3
3 5
4 3
5 3
6 5
7 5
8 4
9 5
dtype: int64
回答3:
Using operators (as a complement to Max's answer):
mask = (df.iloc[:, 1:] > 0).sum(1) >= 3
mask
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
dtype: bool
df[mask]
Gene_name A B C D E F
0 Gene1 1 0 5 2 0 0
1 Gene2 4 45 0 0 32 1
2 Gene3 0 23 0 4 0 54
3 Gene4 12 0 6 8 7 4
4 Gene5 4 0 0 6 0 7
5 Gene6 0 6 8 0 0 5
6 Gene7 13 45 64 234 0 6
7 Gene8 11 6 0 7 7 9
8 Gene9 6 0 12 34 0 11
9 Gene10 23 4 6 7 89 0
Similarly, querying all rows with 5 or more positive values:
df[(df.iloc[:, 1:] > 0).sum(1) >= 5]
Gene_name A B C D E F
3 Gene4 12 0 6 8 7 4
6 Gene7 13 45 64 234 0 6
7 Gene8 11 6 0 7 7 9
9 Gene10 23 4 6 7 89 0
来源:https://stackoverflow.com/questions/46329960/extract-specific-rows-based-on-the-set-cut-off-values-in-columns