Filter column with awk and regexp

前端 未结 6 965
谎友^
谎友^ 2021-02-01 19:08

I\'ve a pretty simple question. I\'ve a file containing several columns and I want to filter them using awk.

So the column of interest is the 6th column and I want to fi

6条回答
  •  温柔的废话
    2021-02-01 19:39

    I know this thread has already been answered, but I actually have a similar problem (relating to finding strings that "consume query"). I'm trying to sum up all of the integers preceding a character like 'S', 'M', 'I', '=', 'X', 'H', as to find the read length via a paired-end read's CIGAR string.

    I wrote a Python script that takes in the column $6 from a SAM/BAM file:

    import sys                      # getting standard input
    import re                       # regular expression module
    
    lines = sys.stdin.readlines()   # gets all CIGAR strings for each paired-end read
    total = 0
    read_id = 1                     # complements id from filter_1.txt
    
    # Get an int array of all the ints matching the pattern 101M, 1S, 70X, etc.
    # Example inputs and outputs: 
    # "49M1S" produces total=50
    # "10M757N40M" produces total=50
    
    for line in lines:
        all_ints = map(int, re.findall(r'(\d+)[SMI=XH]', line))
        for n in all_ints:
            total += n
        print(str(read_id)+ ' ' + str(total))
        read_id += 1
        total = 0
    

    The purpose of the read_id is to mark each read you're going through as "unique", in case if you want to take the read_lengths and print them beside awk-ed columns from a BAM file.

    I hope this helps, or at least helps the next user that has a similar issue. I consulted https://stackoverflow.com/a/11339230 for reference.

提交回复
热议问题