Filter column with awk and regexp

前端 未结 6 956
谎友^
谎友^ 2021-02-01 19:08

I\'ve a pretty simple question. I\'ve a file containing several columns and I want to filter them using awk.

So the column of interest is the 6th column and I want to fi

相关标签:
6条回答
  • 2021-02-01 19:39

    I know this thread has already been answered, but I actually have a similar problem (relating to finding strings that "consume query"). I'm trying to sum up all of the integers preceding a character like 'S', 'M', 'I', '=', 'X', 'H', as to find the read length via a paired-end read's CIGAR string.

    I wrote a Python script that takes in the column $6 from a SAM/BAM file:

    import sys                      # getting standard input
    import re                       # regular expression module
    
    lines = sys.stdin.readlines()   # gets all CIGAR strings for each paired-end read
    total = 0
    read_id = 1                     # complements id from filter_1.txt
    
    # Get an int array of all the ints matching the pattern 101M, 1S, 70X, etc.
    # Example inputs and outputs: 
    # "49M1S" produces total=50
    # "10M757N40M" produces total=50
    
    for line in lines:
        all_ints = map(int, re.findall(r'(\d+)[SMI=XH]', line))
        for n in all_ints:
            total += n
        print(str(read_id)+ ' ' + str(total))
        read_id += 1
        total = 0
    

    The purpose of the read_id is to mark each read you're going through as "unique", in case if you want to take the read_lengths and print them beside awk-ed columns from a BAM file.

    I hope this helps, or at least helps the next user that has a similar issue. I consulted https://stackoverflow.com/a/11339230 for reference.

    0 讨论(0)
  • 2021-02-01 19:45

    This should do the trick:

    awk '$6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/' file
    

    Regexplanation:

    ^                        # Match the start of the string
    (([1-9]|[1-9][0-9]|100)  # Match a single digit 1-9 or double digit 10-99 or 100
    [SM]                     # Character class matching the character S or M
    ){2}                     # Repeat everything in the parens twice
    $                        # Match the end of the string
    

    You have quite a few issue with your statement:

    awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
    
    • == is the string comparision operator. The regex comparision operator is ~.
    • You don't quote regex strings (you never quote anything with single quotes in awk beside the script itself) and your script is missing the final (legal) single quote.
    • [0-9] is the character class for the digit characters, it's not a numeric range. It means match against any character in the class 0,1,2,3,4,5,6,7,8,9 not any numerical value inside the range so [1-100] is not the regular expression for digits in the numerical range 1 - 100 it would match either a 1 or a 0.
    • [SM] is equivalent to (S|M) what you tried [S|M] is the same as (S|\||M). You don't need the OR operator in a character class.

    Awk using the following structure condition{action}. If the condition is True the actions in the following block {} get executed for the current record being read. The condition in my solution is $6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/ which can be read as does the sixth column match the regular expression, if True the line gets printed because if you don't get any actions then awk will execute {print $0} by default.

    0 讨论(0)
  • 2021-02-01 19:47

    Try this:

    awk '$6 ~/^([1-9]|0[1-9]|[1-9][0-9]|100)+[S|M]+([1-9]|0[1-9]|[1-9][0-9]|100)+[S|M]$/' file.txt
    

    Because you did not say exactly how the formatting will be in column 6, the above will work where the column looks like '03M05S', '40S100M', or '3M5S'; and exclude all else. For instance, it will not find '03F05S', '200M05S', '03M005S, 003M05S, or '003M005S'.

    If you can keep the digits in column 6 to two when 0-99, or three when exactly 100 - meaning exactly one leading zero when under 10, and no leading zeros otherwise, then it is a simpler match. You can use the above pattern but exclude single digits (remove the first [1-9] condition), e.g.

    awk '$6 ~/^(0[1-9]|[1-9][0-9]|100)+[S|M]+(0[1-9]|[1-9][0-9]|100)+[S|M]$/' file.txt
    
    0 讨论(0)
  • 2021-02-01 19:48

    The way to write the script you posted:

    awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
    

    in awk so it will do what you SEEM to be trying to do is:

    awk '$6 ~ /^(([1-9][0-9]?|100)[SM]){2}$/' file.txt
    

    Post some sample input and expected output to help us help you more.

    0 讨论(0)
  • 2021-02-01 19:49

    I would do the regex check and the numeric validation as different steps. This code works with GNU awk:

    $ cat data
    a b c d e 132x123y
    a b c d e 123S12M
    a b c d e 12S23M
    a b c d e 12S23Mx
    

    We'd expect only the 3rd line to pass validation

    $ gawk '
        match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) && 
        1 <= m[1] && m[1] <= 100 && 
        1 <= m[2] && m[2] <= 100 {
            print
        }
    ' data
    a b c d e 12S23M
    

    For maintainability, you could encapsulate that into a function:

    gawk '
        function validate6() {
            return( match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) && 
                    1<=m[1] && m[1]<=100 && 
                    1<=m[2] && m[2]<=100 );
        }
        validate6() {print}
    ' data
    
    0 讨论(0)
  • 2021-02-01 19:59

    Regexes cannot check for numeric values. "A number from 1 to 100" is outside what regexes can do. What you can do is check for "1-3 digits."

    You want something like this

    /\d{1,3}[SM]\d{1,3}[SM]/
    

    Note that the character class [SM] doesn't have the ! alternation character. You would only need that if you were writing it as (S|M).

    0 讨论(0)
提交回复
热议问题