I'm dealing with a specific filenames, and need to extract information from them.
The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"
with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".
The information I need to extract is the substring of RANDOMSTR without this optional substring.
I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:
gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING-W0.40+045
The expected results are:
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"
SOME-STRING
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING
How can I get the desired effect.
Thanks.
You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P
does.
$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'
$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"
SOME-STRING
$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"
OTHER-STRING
While the grep solution is very nice indeed, the OP didn't mention an operating system, and the -P
option only seems to be available in Linux. It's also pretty simple to do this in awk.
$ awk -F_ '{sub(/(-W[0-9].[0-9]+.[0-9]+)?\.raw\.gz$/,"",$NF); print $NF}' <<EOT
> 20100613_M4_28007834.005_F_SOME-STRING.raw.gz
> 20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
> EOT
SOME-STRING
OTHER-STRING
$
Note that this breaks on "20100613_M4_28007834.005_F_OTHER-STRING-W0_40+045.raw.gz". If this is a risk, and -W
only shows up in the place shown above, it might be better to use something like:
$ awk -F_ '{sub(/(-W[0-9.+]+)?\.raw\.gz$/,"",$NF); print $NF}'
The difficulty here seems to be the fact that the (.*)
before the optional (-W.*)?
gobbles up the latter text. Using a non-greedy match doesn't help either. My regex-fu is unfortunately too weak to combat this.
If you don't mind a multi-pass solution, then a simpler approach would be to first sanitise the input by removing the trailing .raw.gz
and possible -W*
.
str="20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
echo ${str%.raw.gz} | # remove trailing .raw.gz
sed 's/-W.*$//' | # remove trainling -W.*, if any
sed -nr 's/[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._(.*)/\1/p'
I used sed, but you can just as well use gawk/awk.
Wasn't able to get reluctant quantifiers going, but running through two regexes in sequence does the job:
sed -E -e 's/^.{27}(.*).raw.gz$/\1/' << FOO | sed -E -e 's/-W[0-9.]+\+[0-9.]+$//'
20100613_M4_28007834.005_F_SOME-STRING.raw.gz
20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
FOO
来源:https://stackoverflow.com/questions/4450034/matching-a-specific-substring-with-regular-expressions-using-awk