I\'ve a pretty simple question. I\'ve a file containing several columns and I want to filter them using awk.
So the column of interest is the 6th column and I want to fi
I know this thread has already been answered, but I actually have a similar problem (relating to finding strings that "consume query"). I'm trying to sum up all of the integers preceding a character like 'S', 'M', 'I', '=', 'X', 'H', as to find the read length via a paired-end read's CIGAR string.
I wrote a Python script that takes in the column $6 from a SAM/BAM file:
import sys # getting standard input
import re # regular expression module
lines = sys.stdin.readlines() # gets all CIGAR strings for each paired-end read
total = 0
read_id = 1 # complements id from filter_1.txt
# Get an int array of all the ints matching the pattern 101M, 1S, 70X, etc.
# Example inputs and outputs:
# "49M1S" produces total=50
# "10M757N40M" produces total=50
for line in lines:
all_ints = map(int, re.findall(r'(\d+)[SMI=XH]', line))
for n in all_ints:
total += n
print(str(read_id)+ ' ' + str(total))
read_id += 1
total = 0
The purpose of the read_id is to mark each read you're going through as "unique", in case if you want to take the read_lengths and print them beside awk-ed columns from a BAM file.
I hope this helps, or at least helps the next user that has a similar issue. I consulted https://stackoverflow.com/a/11339230 for reference.