发表新帖

发表新帖

Filter column with awk and regexp

前端未结

关注

 6  965

谎友^ 2021-02-01 19:08

I\'ve a pretty simple question. I\'ve a file containing several columns and I want to filter them using awk.

So the column of interest is the 6th column and I want to fi

6条回答

温柔的废话 (楼主)

2021-02-01 19:39
I know this thread has already been answered, but I actually have a similar problem (relating to finding strings that "consume query"). I'm trying to sum up all of the integers preceding a character like 'S', 'M', 'I', '=', 'X', 'H', as to find the read length via a paired-end read's CIGAR string.

I wrote a Python script that takes in the column $6 from a SAM/BAM file:
```
import sys                      # getting standard input
import re                       # regular expression module

lines = sys.stdin.readlines()   # gets all CIGAR strings for each paired-end read
total = 0
read_id = 1                     # complements id from filter_1.txt

# Get an int array of all the ints matching the pattern 101M, 1S, 70X, etc.
# Example inputs and outputs: 
# "49M1S" produces total=50
# "10M757N40M" produces total=50

for line in lines:
    all_ints = map(int, re.findall(r'(\d+)[SMI=XH]', line))
    for n in all_ints:
        total += n
    print(str(read_id)+ ' ' + str(total))
    read_id += 1
    total = 0
```
The purpose of the read_id is to mark each read you're going through as "unique", in case if you want to take the read_lengths and print them beside awk-ed columns from a BAM file.

I hope this helps, or at least helps the next user that has a similar issue. I consulted https://stackoverflow.com/a/11339230 for reference.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题